conda env create -f env.yml
conda activate align
Download from Tranception
MSA_weights(download & unzip MSA_weights.zip)MSA_files(download & unzip MSA_ProteinGym.zip)- Tranception checkpoints (Small and Large) then place under
ckpt/
Download DMS data from ProteinGym then place under proteingym/
Split the dataset.
python split_dataset.py
The allocated memory depends on the length of the protein, which is very diverse. Therefore, batch size is handled by the functions in core/utils/get_batch.py. It is hard-coded for GPU with 24GB memory, so you may change the value considering your resource constraints.
python pipeline.py --DMS_id IF1_ECOLI_Kelsic_2016
python 1_sft_all.py
python 2_ref_dist.py
python 3_generate_pairs.py
python 4_align_score.py
The scoring files will be saved under dms-results.
python performance.py --input_scoring_files_folder dms-results/$exp_name$ --performance_by_depth
This will evaluate the experiment in terms of correlation metrics.
See tasks/
See tasks/pipeline.py. You may
- Set
target_seqvalue as wild-type sequence - Set
preferenceargument as root directory containing the dataset - Set
DMS_idargument as directory containing your DMS assay - All datasets should consist of columns
mutant,mutated_sequence, andDMS_score.mutantshould be in conventional format, e.g. R42Q, where multiple mutations joined by : symbol. Due to the behavior of pandasjoinfunctions in tranception scoring utils, I recommend to delete other columns.
$preference$/
$DMS_id$/
dms_train.csv
dms_test.csv
dms_val.csv
sft.csv
See draw_figure.ipynb. Note that total.csv contains Spearman's rho correlation of Ours, Alpha-Missense,Tranception Large, Tranception Large (no retrieval), EVE, and ESM1v in ProteinGym benchmark.
- Support LoRA