Skip to content

Data Management

Aditya Khedekar edited this page Jun 19, 2024 · 5 revisions

Loading ChEBI Ontology Data

ChEBai accesses the ChEBI ontology data from the following URL: http://purl.obolibrary.org/obo/chebi/{version}/chebi.obo.

You can find more information on the ChEBI ontology here: https://www.ebi.ac.uk/chebi

ChEBI versions

Change the chebi version used for all sets (default: 200):

--data.init_args.chebi_version=VERSION

To change only the version of the train and validation sets independently of the test set, use

--data.init_args.chebi_version_train=VERSION

Data Preprocessing

Upon loading the ontology data, ChEBai undergoes preprocessing, including hierarchy extraction and division into train, validation, and test sets. During preprocessing, a filter is applied to consider only chemical entities with a minimum number of subclasses (e.g., 50 or 100) annotated with SMILES (Simplified Molecular Input Line Entry System) strings.

Data folder structure

Data is organized within the following directory structure:

Contains the raw chebi data (in .obo format) which is downloaded from respective chebi website

data/${chebi_version}/${dataset_name}/raw/

Contains the processed data with SMILES strings and class columns with boolean values, stored in .pkl format, along with classes.txt file containing the list of classes for the data

data/${chebi_version}/${dataset_name}/processed/

Contains the processed data in .pt format which is compatible with the torch library

data/${chebi_version}/${dataset_name}/processed/${reader_name}/
  • ${dataset_name} represents the _name attribute of the DataModule used.
  • ${chebi_version} refers to the ChEBI version.
  • ${reader_name} denotes the name attribute of the associated Reader class.

For cross-validation, the folds are stored as cv_${n_folds}_fold/fold_{fold_index}_train.pkl and cv_${n_folds}_fold/fold_{fold_index}_validation.pkl in the raw directory.

Clone this wiki locally