Skip to content

Commit 47126b8

Browse files
authored
Merge pull request #3 from ChEB-AI/fix/error-propagation
lookup, chemlog-extra, error propagation, C3P
2 parents 5d1305e + 0377027 commit 47126b8

23 files changed

+1463
-299
lines changed

.github/workflows/lint.yml

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
name: Lint
2+
3+
on: [push, pull_request]
4+
5+
jobs:
6+
lint:
7+
runs-on: ubuntu-latest
8+
9+
steps:
10+
- uses: actions/checkout@v2
11+
12+
- name: Set up Python
13+
uses: actions/setup-python@v4
14+
with:
15+
python-version: '3.10' # or any version your project uses
16+
17+
- name: Install dependencies
18+
run: |
19+
python -m pip install --upgrade pip
20+
pip install black==25.1.0 ruff==0.12.2
21+
22+
- name: Run Black
23+
run: black --check .
24+
25+
- name: Run Ruff (no formatting)
26+
run: ruff check . --no-fix

.gitignore

Lines changed: 179 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,179 @@
1+
# Byte-compiled / optimized / DLL files
2+
__pycache__/
3+
*.py[cod]
4+
*$py.class
5+
6+
# C extensions
7+
*.so
8+
9+
# Distribution / packaging
10+
.Python
11+
build/
12+
develop-eggs/
13+
dist/
14+
downloads/
15+
eggs/
16+
.eggs/
17+
lib/
18+
lib64/
19+
parts/
20+
sdist/
21+
var/
22+
wheels/
23+
share/python-wheels/
24+
*.egg-info/
25+
.installed.cfg
26+
*.egg
27+
MANIFEST
28+
29+
# PyInstaller
30+
# Usually these files are written by a python script from a template
31+
# before PyInstaller builds the exe, so as to inject date/other infos into it.
32+
*.manifest
33+
*.spec
34+
35+
# Installer logs
36+
pip-log.txt
37+
pip-delete-this-directory.txt
38+
39+
# Unit test / coverage reports
40+
htmlcov/
41+
.tox/
42+
.nox/
43+
.coverage
44+
.coverage.*
45+
.cache
46+
nosetests.xml
47+
coverage.xml
48+
*.cover
49+
*.py,cover
50+
.hypothesis/
51+
.pytest_cache/
52+
cover/
53+
54+
# Translations
55+
*.mo
56+
*.pot
57+
58+
# Django stuff:
59+
*.log
60+
local_settings.py
61+
db.sqlite3
62+
db.sqlite3-journal
63+
64+
# Flask stuff:
65+
instance/
66+
.webassets-cache
67+
68+
# Scrapy stuff:
69+
.scrapy
70+
71+
# Sphinx documentation
72+
docs/_build/
73+
docs/build/
74+
75+
# PyBuilder
76+
.pybuilder/
77+
target/
78+
79+
# Jupyter Notebook
80+
.ipynb_checkpoints
81+
82+
# IPython
83+
profile_default/
84+
ipython_config.py
85+
86+
# pyenv
87+
# For a library or package, you might want to ignore these files since the code is
88+
# intended to run in multiple environments; otherwise, check them in:
89+
# .python-version
90+
91+
# pipenv
92+
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
93+
# However, in case of collaboration, if having platform-specific dependencies or dependencies
94+
# having no cross-platform support, pipenv may install dependencies that don't work, or not
95+
# install all needed dependencies.
96+
#Pipfile.lock
97+
98+
# poetry
99+
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
100+
# This is especially recommended for binary packages to ensure reproducibility, and is more
101+
# commonly ignored for libraries.
102+
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
103+
#poetry.lock
104+
105+
# pdm
106+
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
107+
#pdm.lock
108+
# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
109+
# in version control.
110+
# https://pdm.fming.dev/#use-with-ide
111+
.pdm.toml
112+
113+
# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
114+
__pypackages__/
115+
116+
# Celery stuff
117+
celerybeat-schedule
118+
celerybeat.pid
119+
120+
# SageMath parsed files
121+
*.sage.py
122+
123+
# Environments
124+
.env
125+
.venv
126+
env/
127+
venv/
128+
ENV/
129+
env.bak/
130+
venv.bak/
131+
132+
# Spyder project settings
133+
.spyderproject
134+
.spyproject
135+
136+
# Rope project settings
137+
.ropeproject
138+
139+
# mkdocs documentation
140+
/site
141+
142+
# mypy
143+
.mypy_cache/
144+
.dmypy.json
145+
dmypy.json
146+
147+
# Pyre type checker
148+
.pyre/
149+
150+
# pytype static type analyzer
151+
.pytype/
152+
153+
# Cython debug symbols
154+
cython_debug/
155+
156+
# PyCharm
157+
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can
158+
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
159+
# and can be added to the global gitignore or merged into this file. For a more nuclear
160+
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
161+
#.idea/
162+
163+
# configs/ # commented as new configs can be added as a part of a feature
164+
165+
/.idea
166+
/data
167+
/logs
168+
/results_buffer
169+
electra_pretrained.ckpt
170+
171+
build
172+
.virtual_documents
173+
.jupyter
174+
chebai.egg-info
175+
lightning_logs
176+
logs
177+
.isort.cfg
178+
/.vscode
179+
/api/.cloned_repos

.pre-commit-config.yaml

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
repos:
2+
- repo: https://github.com/psf/black
3+
rev: "25.1.0"
4+
hooks:
5+
- id: black
6+
- id: black-jupyter # for formatting jupyter-notebook
7+
8+
- repo: https://github.com/pycqa/isort
9+
rev: 5.13.2
10+
hooks:
11+
- id: isort
12+
name: isort (python)
13+
args: ["--profile=black"]
14+
15+
- repo: https://github.com/asottile/seed-isort-config
16+
rev: v2.2.0
17+
hooks:
18+
- id: seed-isort-config
19+
20+
- repo: https://github.com/pre-commit/pre-commit-hooks
21+
rev: v4.6.0
22+
hooks:
23+
- id: check-yaml
24+
- id: end-of-file-fixer
25+
- id: trailing-whitespace
26+
27+
- repo: https://github.com/astral-sh/ruff-pre-commit
28+
rev: v0.12.2
29+
hooks:
30+
- id: ruff
31+
args: [--fix]

README.md

Lines changed: 49 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,17 @@
11
# python-chebifier
2-
An AI ensemble model for predicting chemical classes in the ChEBI ontology.
2+
An AI ensemble model for predicting chemical classes in the ChEBI ontology. It integrates deep learning models,
3+
rule-based models and generative AI-based models.
4+
5+
A web application for the ensemble is available at https://chebifier.hastingslab.org/.
36

47
## Installation
58

9+
You can get the package from PyPI:
10+
```bash
11+
pip install chebifier
12+
```
13+
14+
or get the latest development version from GitHub:
615
```bash
716
# Clone the repository
817
git clone https://github.com/yourusername/python-chebifier.git
@@ -12,7 +21,7 @@ cd python-chebifier
1221
pip install -e .
1322
```
1423

15-
Some dependencies of `chebai-graph` cannot be installed automatically. If you want to use Graph Neural Networks, follow
24+
`chebai-graph` and its dependencies cannot be installed automatically. If you want to use Graph Neural Networks, follow
1625
the instructions in the [chebai-graph repository](https://github.com/ChEB-AI/python-chebai-graph).
1726

1827
## Usage
@@ -21,23 +30,25 @@ the instructions in the [chebai-graph repository](https://github.com/ChEB-AI/pyt
2130

2231
The package provides a command-line interface (CLI) for making predictions using an ensemble model.
2332

24-
```bash
25-
# Get help
26-
python -m chebifier.cli --help
33+
The ensemble configuration is given by a configuration file (by default, this is `chebifier/ensemble.yml`). If you
34+
want to change which models are included in the ensemble or how they are weighted, you can create your own configuration file.
2735

28-
# Make predictions using a configuration file
29-
python -m chebifier.cli predict configs/example_config.yml --smiles "CC(=O)OC1=CC=CC=C1C(=O)O" "C1=CC=C(C=C1)C(=O)O"
36+
Model weights for deep learning models are downloaded automatically from [Hugging Face](https://huggingface.co/chebai).
37+
However, you can also supply your own model checkpoints (see `configs/example_config.yml` for an example).
3038

31-
# Make predictions using SMILES from a file
32-
python -m chebifier.cli predict configs/example_config.yml --smiles-file smiles.txt
33-
```
39+
```bash
40+
# Make predictions
41+
python -m chebifier predict --smiles "CC(=O)OC1=CC=CC=C1C(=O)O" --smiles "C1=CC=C(C=C1)C(=O)O"
3442

35-
### Configuration File
43+
# Make predictions using SMILES from a file
44+
python -m chebifier predict --smiles-file smiles.txt
3645

37-
The CLI requires a YAML configuration file that defines the ensemble model. An example can be found in `configs/example_config.yml`.
46+
# Make predictions using a configuration file
47+
python -m chebifier predict --ensemble-config configs/my_config.yml --smiles-file smiles.txt
3848

39-
The models and other required files are trained / generated by our [chebai](https://github.com/ChEB-AI/python-chebai) package.
40-
Examples for models can be found on [kaggle](https://www.kaggle.com/datasets/sfluegel/chebai).
49+
# Get all available options
50+
python -m chebifier predict --help
51+
```
4152

4253
### Python API
4354

@@ -67,7 +78,29 @@ for smiles, prediction in zip(smiles_list, predictions):
6778
print("No predictions")
6879
```
6980

81+
### The models
82+
Currently, the following models are supported:
83+
84+
85+
| Model | Description | #Classes | Publication | Repository |
86+
|-------|-------------|----------|-----------------------------------------------------------------------|----------------------------------------------------------------------------------------|
87+
| `electra` | A transformer-based deep learning model trained on ChEBI SMILES strings. | 1522 | [Glauer, Martin, et al., 2024: Chebifier: Automating semantic classification in ChEBI to accelerate data-driven discovery, Digital Discovery 3 (2024) 896-907](https://pubs.rsc.org/en/content/articlehtml/2024/dd/d3dd00238a) | [python-chebai](https://github.com/ChEB-AI/python-chebai) |
88+
| `resgated` | A Residual Gated Graph Convolutional Network trained on ChEBI molecules. | 1522 | | [python-chebai-graph](https://github.com/ChEB-AI/python-chebai-graph) |
89+
| `chemlog_peptides` | A rule-based model specialised on peptide classes. | 18 | [Flügel, Simon, et al., 2025: ChemLog: Making MSOL Viable for Ontological Classification and Learning, arXiv](https://arxiv.org/abs/2507.13987) | [chemlog-peptides](https://github.com/sfluegel05/chemlog-peptides) |
90+
| `chemlog_element`, `chemlog_organox` | Extensions of ChemLog for classes that are defined either by the presence of a specific element or by the presence of an organic bond. | 118 + 37 | | [chemlog-extra](https://github.com/ChEB-AI/chemlog-extra) |
91+
| `c3p` | A collection _Chemical Classifier Programs_, generated by LLMs based on the natural language definitions of ChEBI classes. | 338 | [Mungall, Christopher J., et al., 2025: Chemical classification program synthesis using generative artificial intelligence, arXiv](https://arxiv.org/abs/2505.18470) | [c3p](https://github.com/chemkg/c3p) |
92+
93+
In addition, Chebifier also includes a ChEBI lookup that automatically retrieves the ChEBI superclasses for a class
94+
matched by a SMILES string. This is not activated by default, but can be included by adding
95+
```yaml
96+
chebi_lookup:
97+
type: chebi_lookup
98+
model_weight: 10 # optional
99+
```
100+
to your configuration file.
101+
70102
### The ensemble
103+
<img width="700" alt="ensemble_architecture" src="https://github.com/user-attachments/assets/9275d3cd-ac88-466f-a1e9-27d20d67543b" />
71104
72105
Given a sample (i.e., a SMILES string) and models $m_1, m_2, \ldots, m_n$, the ensemble works as follows:
73106
1. Get predictions from each model $m_i$ for the sample.
@@ -103,7 +136,7 @@ Trust is based on the model's performance on a validation set. After training, w
103136
on a validation set for each class. If the `ensemble_type` is set to `wmv-f1`, the trust is calculated as 1 + the F1 score.
104137
If the `ensemble_type` is set to `mv` (the default), the trust is set to 1 for all models.
105138

106-
### Inconsistency correction
139+
### Inconsistency resolution
107140
After a decision has been made for each class independently, the consistency of the predictions with regard to the ChEBI hierarchy
108141
and disjointness axioms is checked. This is
109142
done in 3 steps:
@@ -114,7 +147,7 @@ belongs to the direct and indirect superclasses (e.g., primary alcohol, aromatic
114147
- (2) Next, we check for disjointness. This is not specified directly in ChEBI, but in an additional ChEBI module ([chebi-disjoints.owl](https://ftp.ebi.ac.uk/pub/databases/chebi/ontology/)).
115148
We have extracted these disjointness axioms into a CSV file and added some more disjointness axioms ourselves (see
116149
`data>disjoint_chebi.csv` and `data>disjoint_additional.csv`). If two classes $A$ and $B$ are disjoint and we predict
117-
both, we select one of them randomly (https://github.com/ChEB-AI/python-chebifier/issues/6) and set the other to 0.
150+
both, we select one with the higher class score and set the other to 0.
118151
- (3) Since the second step might have introduced new inconsistencies into the hierarchy, we repeat the first step, but
119152
with a small change. For a pair of classes $A \subseteq B$ with predictions $1$ and $0$, instead of setting $B$ to $1$,
120153
we now set $A$ to $0$. This has the advantage that we cannot introduce new disjointness-inconsistencies and don't have

chebifier/__main__.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
from chebifier.cli import cli
2+
3+
if __name__ == "__main__":
4+
cli()

chebifier/check_env.py

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
import subprocess
2+
import sys
3+
4+
5+
def get_current_environment() -> str:
6+
"""
7+
Return the path of the Python executable for the current environment.
8+
"""
9+
return sys.executable
10+
11+
12+
def check_package_installed(package_name: str) -> None:
13+
"""
14+
Check if the given package is installed in the current Python environment.
15+
"""
16+
python_exec = get_current_environment()
17+
try:
18+
subprocess.check_output(
19+
[python_exec, "-m", "pip", "show", package_name], stderr=subprocess.DEVNULL
20+
)
21+
print(f"✅ Package '{package_name}' is already installed.")
22+
except subprocess.CalledProcessError:
23+
raise (
24+
f"❌ Please install '{package_name}' into your environment: {python_exec}"
25+
)
26+
27+
28+
if __name__ == "__main__":
29+
print(f"🔍 Using Python executable: {get_current_environment()}")
30+
check_package_installed("numpy") # Replace with your desired package

0 commit comments

Comments
 (0)