-
Notifications
You must be signed in to change notification settings - Fork 5
Data Management
ChEBai accesses the ChEBI ontology data from the following URL: http://purl.obolibrary.org/obo/chebi/{version}/chebi.obo.
You can find more information on the ChEBI ontology here: https://www.ebi.ac.uk/chebi
Change the chebi version used for all sets (default: 200):
--data.init_args.chebi_version=VERSION
To change only the version of the train and validation sets independently of the test set, use
--data.init_args.chebi_version_train=VERSION
Upon loading the ontology data, ChEBai undergoes preprocessing, including hierarchy extraction and division into train, validation, and test sets. During preprocessing, a filter is applied to consider only chemical entities with a minimum number of subclasses (e.g., 50 or 100) annotated with SMILES (Simplified Molecular Input Line Entry System) strings.
Data is organized within the following directory structure:
Contains the raw chebi data (in .obo
format) which is downloaded from respective chebi website
data/${chebi_version}/${dataset_name}/raw/
Contains the processed data with SMILES strings and class columns with boolean values, stored in .pkl
format, along with classes.txt
file containing the list of classes for the data
data/${chebi_version}/${dataset_name}/processed/
Contains the processed data in .pt
format which is compatible with the torch
library
data/${chebi_version}/${dataset_name}/processed/${reader_name}/
- ${dataset_name} represents the _name attribute of the DataModule used.
- ${chebi_version} refers to the ChEBI version.
- ${reader_name} denotes the name attribute of the associated Reader class.
For cross-validation, the folds are stored as cv_${n_folds}_fold/fold_{fold_index}_train.pkl
and cv_${n_folds}_fold/fold_{fold_index}_validation.pkl
in the raw directory.