Bio2Token is a deep learning-based autoencoder designed for quantizing any biological structures into a discrete token format. This is the official repository of the paper Bio2Token: All-atom tokenization of any biomolecular structure with Mamba.
- Setup Guide
- MLflow Server
- Model Architecture
- Datasets
- Training
- Testing
- Testing on Raw PDB Files
- License
To manage environments efficiently, we use uv. It simplifies managing dependencies and executing scripts.
- Install
uvand synchronize your environment:pip install uv uv sync
We utilize MLflow for tracking model metrics, experiments, and parameters. To start the MLflow server, use the command below and access the monitoring interface at http://localhost:8080:
mlflow server --host 127.0.0.1 --port 8080Update the MLflow server configuration as needed in configs/trainer.yaml:
mlflow:
experiment_name: bio2token
tracking_server_host: 127.0.0.1
tracking_server_port: 8080The autoencoder consists of Mamba layers and a FSQ quantizer layer. The model implementation can be found in src/bio2token/models/autoencoder.py. You can customize the model configuration via configs/model.yaml. Both RMSD and Inter-Atom Distance are used as loss functions, with TM-score available for monitoring performance.
We provide pre-processed datasets for download from zenodo. Unzip the directory into the highest level of the repo: bio2token/data. The data/ folder contains:
- CATH dataset
- CASP 14 and 15 datasets (testing only)
- NablaDFT dataset
- RNA3DB dataset
Adjust data configurations using configs/data.yaml. Pre-processed AlphaFoldDB data is also available upon request.
data:
ds_name: [cath, nabladft, rna3db]
ds_name_val: cath
batch_size_per_gpu: 8
batch_size_per_gpu_val: 8
num_workers: 4
dataset:
cath:
train_split: train
val_split: validation
max_length: 4096
nabladft:
train_split: train_100k
max_data: 10000
rna3db:
train_split: train_set
max_length: 4096
max_data: 10000
Train the Bio2Token model with the following command:
uv run scripts/train.py --config train.yamlConfiguration details are available in configs/train.yaml, which imports model.yaml, data.yaml, and trainer.yaml. The model weights are saved under checkpoints/${experiment_name}/${run_id}.
To test the model on a pre-processed dataset, use:
uv run scripts/test.py --config test.yamlConfigure the dataset and model weights in configs/test.yaml:
infer:
experiment_name: bio2token
run_id: bio2token_pretrained
data:
ds_name: rna3dbResults are saved under results/${experiment_name}/${run_id}/${checkpoint_name}/${ds_name}/${split_name}/test_outputs.parquet.
Test on raw PDB files using:
uv run scripts/test_pdb.py --config test_pdb.yamlSpecify the PDB file and model weights in configs/test_pdb.yaml:
infer:
experiment_name: bio2token
run_id: bio2token_pretrained
data:
ds_name: examples/pdbs/107l.pdbResults are saved under results/pdb/${pdb_id}/${run_id}/${checkpoint_name}/. The ground-truth and reconstructed structures as well as the losses and estimated tokens are saved as gt.pdb, rec.pdb, and outputs.json.
This project code is licensed under the MIT license. See LICENSE for more details. The dataset licenses are specified on the the download page.


