SPMM: Structure-Property Multi-Modal learning for molecules

The official GitHub for SPMM, a multi-modal molecular pre-trained model for a synergistic comprehension of molecular structure and properties. The details can be found in the following paper: Bidirectional Generation of Structure and Properties Through a Single Molecular Foundation Model. (Nature Communications 2024)

Molecule structure will be given in SMILES, and we used 53 simple chemical properties to build a property vector(PV) of a molecule.

The model checkpoint and data are too heavy to be included in this repo, and they can be found here.

Files

data/: Contains the data used for the experiments in the paper. (you have to make this folder and put the data that you downloaded from the link above.)
Pretrain/: Contains the checkpoint of the pre-trained SPMM. (you have to make this folder and put the checkpoint that you downloaded from the link above.)
vocab_bpe_300.txt: Contains the SMILES tokens for the SMILES tokenizer.
property_name.txt: Contains the name of the 53 chemical properties.
normalize.pkl: Contains the mean and standard deviation of the 53 chemical properties that we used for PV.
calc_property.py: Contains the code to calculate the 53 chemical properties and build a PV for a given SMILES. Modify this code accordingly to utilize SPMM pre-training for your custom PVs.
SPMM_models.py: Contains the code for the SPMM model and its pre-training codes.
SPMM_pretrain.py: runs SPMM pre-training.
d_*.py: Codes for the downstream tasks.

Requirements

Run pip install -r requirements.txt to install the required packages.

Code running

Arguments can be passed with commands, or be edited manually in the running code.

Pre-training

python SPMM_pretrain.py --data_path './data/pretrain.txt'

PV-to-SMILES generation
- batched: The model takes PVs from the molecules in input_file, and generates molecules with those PVs using k-beam search. The generated molecules will be written in generated_molecules.txt.
```
python d_pv2smiles_batched.py --checkpoint './Pretrain/checkpoint_SPMM.ckpt' --input_file './data/pubchem_1k_unseen.txt' --k 2
```
- single: The model takes one query PV and generates n_generate molecules with that PV using k-beam search. The generated molecules will be written in generated_molecules.txt. Here, you need to build your input PV in the file p2s_input.csv. Check the four examples that we included.
```
python d_pv2smiles_single.py --checkpoint './Pretrain/checkpoint_SPMM.ckpt' --n_generate 1000 --stochastic True --k 2
```

SMILES-to-PV generation

The model takes the query molecules in input_file, and generates their PV.

python d_smiles2pv.py --checkpoint './Pretrain/checkpoint_SPMM.ckpt' --input_file './data/pubchem_1k_unseen.txt'

MoleculeNet + DILI prediction task

d_regression.py, d_classification.py, and d_classification_multilabel.py, perform regression, binary classification, and multi-label classification tasks, respectively.

python d_regression.py --checkpoint './Pretrain/checkpoint_SPMM.ckpt' --name 'bace'
python d_classification.py --checkpoint './Pretrain/checkpoint_SPMM.ckpt' --name 'bbbp'
python d_classification_multilabel.py --checkpoint './Pretrain/checkpoint_SPMM.ckpt' --name 'clintox'

Forward/retro-reaction prediction tasks

d_rxn_prediction.py performs both forward/reverse reaction prediction tasks on USPTO-480k and USPTO-50k datasets.

e.g. forward reaction prediction, no beam search
```
python d_rxn_prediction.py --checkpoint './Pretrain/checkpoint_SPMM.ckpt' --mode 'forward' --n_beam 1 
```
e.g. retro reaction prediction, beam search with k=3
```
python d_rxn_prediction.py --checkpoint './Pretrain/checkpoint_SPMM.ckpt' --mode 'retro' --n_beam 3 
```

Acknowledgement

The code for BERT with cross-attention layers xbert.py and schedulers are modified from the one in ALBEF.
The code for SMILES augmentation is taken from pysmilesutils.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SPMM: Structure-Property Multi-Modal learning for molecules

Files

Requirements

Code running

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 95 Commits
scheduler		scheduler
LICENSE		LICENSE
README.md		README.md
SPMM_models.py		SPMM_models.py
SPMM_models_rxn.py		SPMM_models_rxn.py
SPMM_pretrain.py		SPMM_pretrain.py
calc_property.py		calc_property.py
config_bert.json		config_bert.json
config_bert_property.json		config_bert_property.json
config_bert_smiles.json		config_bert_smiles.json
d_classification.py		d_classification.py
d_classification_multilabel.py		d_classification_multilabel.py
d_pv2smiles_batched.py		d_pv2smiles_batched.py
d_pv2smiles_single.py		d_pv2smiles_single.py
d_regression.py		d_regression.py
d_rxn_prediction.py		d_rxn_prediction.py
d_smiles2pv.py		d_smiles2pv.py
dataset.py		dataset.py
normalize.pkl		normalize.pkl
p2s_input.csv		p2s_input.csv
property_name.txt		property_name.txt
requirements.txt		requirements.txt
s2p_input.txt		s2p_input.txt
vocab_bpe_300.txt		vocab_bpe_300.txt
xbert.py		xbert.py

License

jinhojsk515/SPMM

Folders and files

Latest commit

History

Repository files navigation

SPMM: Structure-Property Multi-Modal learning for molecules

Files

Requirements

Code running

Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages