align-plm

Get Started

Environment

conda env create -f env.yml
conda activate align

Data

Download from Tranception

MSA_weights (download & unzip MSA_weights.zip)
MSA_files (download & unzip MSA_ProteinGym.zip)
Tranception checkpoints (Small and Large) then place under ckpt/

Download DMS data from ProteinGym then place under proteingym/

Split the dataset.

python split_dataset.py

Resource constraints

The allocated memory depends on the length of the protein, which is very diverse. Therefore, batch size is handled by the functions in core/utils/get_batch.py. It is hard-coded for GPU with 24GB memory, so you may change the value considering your resource constraints.

Fitness Prediction in ProteinGym

Pipeline for Single DMS assay

python pipeline.py --DMS_id IF1_ECOLI_Kelsic_2016

Run for all ProteinGym benchmarks

python 1_sft_all.py
python 2_ref_dist.py
python 3_generate_pairs.py
python 4_align_score.py

The scoring files will be saved under dms-results.

python performance.py --input_scoring_files_folder dms-results/$exp_name$ --performance_by_depth

This will evaluate the experiment in terms of correlation metrics.

Fitness Optimization in AAV and ACE2

See tasks/

Pipeline for Custom DMS assay

See tasks/pipeline.py. You may

Set target_seq value as wild-type sequence
Set preference argument as root directory containing the dataset
Set DMS_id argument as directory containing your DMS assay
All datasets should consist of columns mutant, mutated_sequence, and DMS_score. mutant should be in conventional format, e.g. R42Q, where multiple mutations joined by : symbol. Due to the behavior of pandas join functions in tranception scoring utils, I recommend to delete other columns.

$preference$/
   $DMS_id$/
      dms_train.csv
      dms_test.csv
      dms_val.csv
      sft.csv

Reproduce the Figures

See draw_figure.ipynb. Note that total.csv contains Spearman's rho correlation of Ours, Alpha-Missense,Tranception Large, Tranception Large (no retrieval), EVE, and ESM1v in ProteinGym benchmark.

TODO

Support LoRA

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

align-plm

Get Started

Environment

Data

Resource constraints

Fitness Prediction in ProteinGym

Pipeline for Single DMS assay

Run for all ProteinGym benchmarks

Fitness Optimization in AAV and ACE2

Pipeline for Custom DMS assay

Reproduce the Figures

TODO

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
analyze		analyze
core		core
reference		reference
tasks		tasks
tranception		tranception
.gitignore		.gitignore
1_sft_all.py		1_sft_all.py
2_ref_dist.py		2_ref_dist.py
3_generate_pairs.py		3_generate_pairs.py
4_align_score.py		4_align_score.py
README.md		README.md
draw_figure.ipynb		draw_figure.ipynb
env.yml		env.yml
performance.py		performance.py
pipeline.py		pipeline.py
split_dataset.py		split_dataset.py
total.csv		total.csv

haewonc/align-plm

Folders and files

Latest commit

History

Repository files navigation

align-plm

Get Started

Environment

Data

Resource constraints

Fitness Prediction in ProteinGym

Pipeline for Single DMS assay

Run for all ProteinGym benchmarks

Fitness Optimization in AAV and ACE2

Pipeline for Custom DMS assay

Reproduce the Figures

TODO

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages