This repository demonstrates an end-to-end protein workflow that integrates NVIDIA BioNeMo Framework, NIMs (NVIDIA Inference Microservices), and Valohai MLOps for seamless model training, inference, evaluation, and deployment.
This project shows how to:
- Preprocess and load biological datasets for BioNeMo models
- Run ESM2 inference for protein property prediction
- Generate novel protein sequences with ProtGPT2
- Evaluate generated sequences with custom metrics
- Visualize protein similarity embeddings
- Deploy models into a NIM container for production inference
- prepare-data.py → Preprocess and prepare datasets for BioNeMo models
- predict-properties.py → Run property prediction using BioNeMo’s pretrained ESM2 models
- generate-proteins.py → Generate protein sequences with ProtGPT2
- evaluate-generated.py → Evaluate quality and diversity of generated proteins
- protein-similarity-visualization.py → Visualize similarity between embeddings
- convert-model.py → Convert Hugging Face models into NIM-compatible format (safetensors layout)
- valohai.yaml → Defines all pipeline steps for Valohai execution
The Valohai pipeline automates the complete workflow:
- Uses BioNeMo framework container (
nvcr.io/nvidia/clara/bionemo-framework:1.3) - Converts UniRef50 dataset to FASTA
- Splits the dataset for training, validation and testing
-
Runs ESM2 property prediction with BioNeMo framework (nightly image)
-
Configurable parameters:
num_gpus(default: 1)precision(fp16,fp32)micro-batch-size
-
Runs ProtGPT2 sequence generation on CUDA runtime
-
Parameters for controlling generation:
max_lengthtop_krepetition_penaltynum_return_sequenceseos_token_id
- Evaluates generated sequences for validity and diversity
- Inputs sequences from the previous step
-
Embedding-based similarity visualization
-
Parameters:
query_idx→ query sequence indextopk→ number of similar sequences retrieved
Prerequisites: You have to setup a gpu instance, and initially setup the restart_nim.sh
example of the one used for this project
#!/bin/bash
set -e
CONTAINER_NAME="protgpt2-nim"
echo "[INFO] Stopping old container (if any)..."
docker rm -f $CONTAINER_NAME 2>/dev/null || true
echo "[INFO] Starting new ProtGPT2 NIM..."
docker run -d --rm --name=protgpt2-nim \
--runtime=nvidia --gpus all \
-p 8000:8000 \
-v /home/ec2-user/models/protgpt2:/models/protgpt2 \
-e NIM_MODEL_NAME=/models/protgpt2 \
nvcr.io/nim/nvidia/llm-nim:1.13.0
echo "[INFO] Waiting for NIM to become healthy..."
for i in {1..150}; do
if curl -fsS http://localhost:8000/v1/metadata >/dev/null; then
echo "[INFO] NIM is healthy!"
exit 0
fi
if ! docker ps --format '{{.Names}}' | grep -q "^$CONTAINER_NAME$"; then
echo "[ERROR] Container exited unexpectedly"
docker logs --tail=200 $CONTAINER_NAME
exit 1
- Converts models into NIM safetensors layout
- Copies them securely to a remote NIM host (eg.EC2 GPU instance)
- Restarts the NIM container to load the new model
- Requires SSH key as Valohai input
Parameters:
host_name→ target instance host
To execute the full workflow:
vh pipeline run bionemo_end_to_endThis will:
- Load and preprocess the dataset
- Predict protein properties with BioNeMo
- Generate new protein sequences
- Evaluate sequence quality
- Visualize similarities
- Deploy models into NIM for production use (Requires user approval)
- NVIDIA BioNeMo Framework
- NVIDIA CUDA 11.8 + PyTorch 2.7.1
- Valohai MLOps platform
- Hugging Face Transformers (for ProtGPT2 conversion)
- Python packages:
torch,pandas,scikit-learn,matplotlib,valohai-utils
Install Python dependencies locally with:
pip install -r requirements.txtNIMs require models in a specific safetensors layout:
config.json
model.safetensors
tokenizer.json
tokenizer_config.json
special_tokens_map.json
vocab.json
The step deploy-NIM automatically handles this conversion and deployment.
-
You must provide your SSH private key as a Valohai input for deployment.
-
For NVIDIA NGC images, set up authentication in Valohai project registry:
- image pattern:
nvcr.io/* - username:
$oauthtoken - password: your
NGC_API_KEY
- image pattern:
- NVIDIA BioNeMo – Foundation models for proteins, chemistry, and biology
- NVIDIA NIMs – Optimized inference microservices for deploying AI models
- Valohai – End-to-end machine learning automation platform