Bio2Token: Deep Learning Autoencoder for Biological Tokenization

Bio2Token is a deep learning-based autoencoder designed for quantizing any biological structures into a discrete token format. This is the official repository of the paper Bio2Token: All-atom tokenization of any biomolecular structure with Mamba.

Setup Guide

To manage environments efficiently, we use uv. It simplifies managing dependencies and executing scripts.

Install uv and synchronize your environment:
```
pip install uv
uv sync
```

MLflow Server

We utilize MLflow for tracking model metrics, experiments, and parameters. To start the MLflow server, use the command below and access the monitoring interface at http://localhost:8080:

mlflow server --host 127.0.0.1 --port 8080

Update the MLflow server configuration as needed in configs/trainer.yaml:

mlflow:
  experiment_name: bio2token
  tracking_server_host: 127.0.0.1
  tracking_server_port: 8080

Model Architecture

The autoencoder consists of Mamba layers and a FSQ quantizer layer. The model implementation can be found in src/bio2token/models/autoencoder.py. You can customize the model configuration via configs/model.yaml. Both RMSD and Inter-Atom Distance are used as loss functions, with TM-score available for monitoring performance.

Datasets

We provide pre-processed datasets for download from zenodo. Unzip the directory into the highest level of the repo: bio2token/data. The data/ folder contains:

Adjust data configurations using configs/data.yaml. Pre-processed AlphaFoldDB data is also available upon request.

data:
  ds_name: [cath, nabladft, rna3db]
  ds_name_val: cath
  batch_size_per_gpu: 8
  batch_size_per_gpu_val: 8
  num_workers: 4
  dataset:
    cath:
      train_split: train
      val_split: validation
      max_length: 4096
    nabladft:
      train_split: train_100k
      max_data: 10000
    rna3db:
      train_split: train_set
      max_length: 4096
      max_data: 10000

Training

Train the Bio2Token model with the following command:

uv run scripts/train.py --config train.yaml

Configuration details are available in configs/train.yaml, which imports model.yaml, data.yaml, and trainer.yaml. The model weights are saved under checkpoints/${experiment_name}/${run_id}.

Testing

To test the model on a pre-processed dataset, use:

uv run scripts/test.py --config test.yaml

Configure the dataset and model weights in configs/test.yaml:

infer:
  experiment_name: bio2token
  run_id: bio2token_pretrained
data:
  ds_name: rna3db

Results are saved under results/${experiment_name}/${run_id}/${checkpoint_name}/${ds_name}/${split_name}/test_outputs.parquet.

Testing on Raw PDB Files

Test on raw PDB files using:

uv run scripts/test_pdb.py --config test_pdb.yaml

Specify the PDB file and model weights in configs/test_pdb.yaml:

infer:
  experiment_name: bio2token
  run_id: bio2token_pretrained
data:
  ds_name: examples/pdbs/107l.pdb

Results are saved under results/pdb/${pdb_id}/${run_id}/${checkpoint_name}/. The ground-truth and reconstructed structures as well as the losses and estimated tokens are saved as gt.pdb, rec.pdb, and outputs.json.

License

This project code is licensed under the MIT license. See LICENSE for more details. The dataset licenses are specified on the the download page.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
checkpoints/bio2token		checkpoints/bio2token
configs		configs
examples/pdbs		examples/pdbs
figs		figs
scripts		scripts
src/bio2token		src/bio2token
LICENSE.txt		LICENSE.txt
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Bio2Token: Deep Learning Autoencoder for Biological Tokenization

Table of Contents

Setup Guide

MLflow Server

Model Architecture

Datasets

Training

Testing

Testing on Raw PDB Files

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

flagshippioneering/bio2token

Folders and files

Latest commit

History

Repository files navigation

Bio2Token: Deep Learning Autoencoder for Biological Tokenization

Table of Contents

Setup Guide

MLflow Server

Model Architecture

Datasets

Training

Testing

Testing on Raw PDB Files

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages