The official GitHub for SPMM, a multi-modal molecular pre-trained model for a synergistic comprehension of molecular structure and properties. The details can be found in the following paper: Bidirectional Generation of Structure and Properties Through a Single Molecular Foundation Model. (Nature Communications 2024)
Molecule structure will be given in SMILES, and we used 53 simple chemical properties to build a property vector(PV) of a molecule.
The model checkpoint and data are too heavy to be included in this repo, and they can be found here.
data/: Contains the data used for the experiments in the paper. (you have to make this folder and put the data that you downloaded from the link above.)Pretrain/: Contains the checkpoint of the pre-trained SPMM. (you have to make this folder and put the checkpoint that you downloaded from the link above.)vocab_bpe_300.txt: Contains the SMILES tokens for the SMILES tokenizer.property_name.txt: Contains the name of the 53 chemical properties.normalize.pkl: Contains the mean and standard deviation of the 53 chemical properties that we used for PV.calc_property.py: Contains the code to calculate the 53 chemical properties and build a PV for a given SMILES. Modify this code accordingly to utilize SPMM pre-training for your custom PVs.SPMM_models.py: Contains the code for the SPMM model and its pre-training codes.SPMM_pretrain.py: runs SPMM pre-training.d_*.py: Codes for the downstream tasks.
Run pip install -r requirements.txt to install the required packages.
Arguments can be passed with commands, or be edited manually in the running code.
-
Pre-training
python SPMM_pretrain.py --data_path './data/pretrain.txt' -
PV-to-SMILES generation
- batched: The model takes PVs from the molecules in
input_file, and generates molecules with those PVs using k-beam search. The generated molecules will be written ingenerated_molecules.txt.python d_pv2smiles_batched.py --checkpoint './Pretrain/checkpoint_SPMM.ckpt' --input_file './data/pubchem_1k_unseen.txt' --k 2 - single: The model takes one query PV and generates
n_generatemolecules with that PV using k-beam search. The generated molecules will be written ingenerated_molecules.txt. Here, you need to build your input PV in the filep2s_input.csv. Check the four examples that we included.python d_pv2smiles_single.py --checkpoint './Pretrain/checkpoint_SPMM.ckpt' --n_generate 1000 --stochastic True --k 2
- batched: The model takes PVs from the molecules in
-
SMILES-to-PV generation
The model takes the query molecules in
input_file, and generates their PV.python d_smiles2pv.py --checkpoint './Pretrain/checkpoint_SPMM.ckpt' --input_file './data/pubchem_1k_unseen.txt' -
MoleculeNet + DILI prediction task
d_regression.py,d_classification.py, andd_classification_multilabel.py, perform regression, binary classification, and multi-label classification tasks, respectively.python d_regression.py --checkpoint './Pretrain/checkpoint_SPMM.ckpt' --name 'bace' python d_classification.py --checkpoint './Pretrain/checkpoint_SPMM.ckpt' --name 'bbbp' python d_classification_multilabel.py --checkpoint './Pretrain/checkpoint_SPMM.ckpt' --name 'clintox' -
Forward/retro-reaction prediction tasks
d_rxn_prediction.pyperforms both forward/reverse reaction prediction tasks on USPTO-480k and USPTO-50k datasets.e.g. forward reaction prediction, no beam search
python d_rxn_prediction.py --checkpoint './Pretrain/checkpoint_SPMM.ckpt' --mode 'forward' --n_beam 1e.g. retro reaction prediction, beam search with k=3
python d_rxn_prediction.py --checkpoint './Pretrain/checkpoint_SPMM.ckpt' --mode 'retro' --n_beam 3
- The code for BERT with cross-attention layers
xbert.pyand schedulers are modified from the one in ALBEF. - The code for SMILES augmentation is taken from pysmilesutils.
