Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 29 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,9 @@ A web application for the ensemble is available at https://chebifier.hastingslab

Not all models can be installed automatically at the moment:
- `chebai-graph` and its dependencies. To install them, follow
the instructions in the [chebai-graph repository](https://github.com/ChEB-AI/python-chebai-graph).
the instructions in the [chebai-graph repository](https://github.com/ChEB-AI/python-chebai-graph).
- `chemlog-extra` can be installed with `pip install git+https://github.com/ChEB-AI/chemlog-extra.git`
- The automatically installed version of `c3p` may not work under Windows. If you want to run chebifier on Windows, we
- The automatically installed version of `c3p` may not work under Windows. If you want to run chebifier on Windows, we
recommend using this forked version: `pip install git+https://github.com/sfluegel05/c3p.git`


Expand Down Expand Up @@ -38,11 +38,26 @@ The package provides a command-line interface (CLI) for making predictions using
The ensemble configuration is given by a configuration file (by default, this is `chebifier/ensemble.yml`). If you
want to change which models are included in the ensemble or how they are weighted, you can create your own configuration file.

Model weights for deep learning models are downloaded automatically from [Hugging Face](https://huggingface.co/chebai).
Model weights for deep learning models are automatically downloaded from [Hugging Face](https://huggingface.co/chebai).
To use specific model weights from Hugging face, add the `load_model` key in your configuration file. For example:

```yaml
my_electra:
type: electra
load_model: "electra_chebi50_v241"
```

### Available model weights:

* `electra_chebi50_v241`
* `resgated_chebi50_v241`
* `c3p_with_weights`


However, you can also supply your own model checkpoints (see `configs/example_config.yml` for an example).

```bash
# Make predictions
# Make predictions
python -m chebifier predict --smiles "CC(=O)OC1=CC=CC=C1C(=O)O" --smiles "C1=CC=C(C=C1)C(=O)O"

# Make predictions using SMILES from a file
Expand Down Expand Up @@ -96,7 +111,7 @@ Currently, the following models are supported:
| `c3p` | A collection _Chemical Classifier Programs_, generated by LLMs based on the natural language definitions of ChEBI classes. | 338 | [Mungall, Christopher J., et al., 2025: Chemical classification program synthesis using generative artificial intelligence, arXiv](https://arxiv.org/abs/2505.18470) | [c3p](https://github.com/chemkg/c3p) |

In addition, Chebifier also includes a ChEBI lookup that automatically retrieves the ChEBI superclasses for a class
matched by a SMILES string. This is not activated by default, but can be included by adding
matched by a SMILES string. This is not activated by default, but can be included by adding
```yaml
chebi_lookup:
type: chebi_lookup
Expand All @@ -109,15 +124,15 @@ to your configuration file.

Given a sample (i.e., a SMILES string) and models $m_1, m_2, \ldots, m_n$, the ensemble works as follows:
1. Get predictions from each model $m_i$ for the sample.
2. For each class $c$, aggregate predictions $p_c^{m_i}$ from all models that made a prediction for that class.
2. For each class $c$, aggregate predictions $p_c^{m_i}$ from all models that made a prediction for that class.
The aggregation happens separately for all positive predictions (i.e., $p_c^{m_i} \geq 0.5$) and all negative predictions
($p_c^{m_i} < 0.5$). If the aggregated value is larger for the positive predictions than for the negative predictions,
the ensemble makes a positive prediction for class $c$:

<img width="2297" height="114" alt="image" src="https://github.com/user-attachments/assets/2f0263ae-83ac-41ea-938a-c71b46082c22" />
<!-- For some reason, this formula does not render in GitHub markdown. Therefore, I rendered it locally and added it as an image. The rendered formula is:
$$
\text{ensemble}(c) = \begin{cases}
\text{ensemble}(c) = \begin{cases}
1 & \text{if } \sum_{i: p_c^{m_i} \geq 0.5} [\text{confidence}_c^{m_i} \cdot \text{model_weight}_{m_i} \cdot \text{trust}_c^{m_i}] > \sum_{i: p_c^{m_i} < 0.5} [\text{confidence}_c^{m_i} \cdot \text{model_weight}_{m_i} \cdot \text{trust}_c^{m_i}] \\
0 & \text{otherwise}
\end{cases}
Expand All @@ -135,25 +150,25 @@ Therefore, if in doubt, we are more confident in the negative prediction.

Confidence can be disabled by the `use_confidence` parameter of the predict method (default: True).

The model_weight can be set for each model in the configuration file (default: 1). This is used to favor a certain
model independently of a given class.
Trust is based on the model's performance on a validation set. After training, we evaluate the Machine Learning models
The model_weight can be set for each model in the configuration file (default: 1). This is used to favor a certain
model independently of a given class.
Trust is based on the model's performance on a validation set. After training, we evaluate the Machine Learning models
on a validation set for each class. If the `ensemble_type` is set to `wmv-f1`, the trust is calculated as 1 + the F1 score.
If the `ensemble_type` is set to `mv` (the default), the trust is set to 1 for all models.

### Inconsistency resolution
After a decision has been made for each class independently, the consistency of the predictions with regard to the ChEBI hierarchy
After a decision has been made for each class independently, the consistency of the predictions with regard to the ChEBI hierarchy
and disjointness axioms is checked. This is
done in 3 steps:
- (1) First, the hierarchy is corrected. For each pair of classes $A$ and $B$ where $A$ is a subclass of $B$ (following
the is-a relation in ChEBI), we set the ensemble prediction of $B$ to 1 if the prediction of $A$ is 1. Intuitively
- (1) First, the hierarchy is corrected. For each pair of classes $A$ and $B$ where $A$ is a subclass of $B$ (following
the is-a relation in ChEBI), we set the ensemble prediction of $B$ to 1 if the prediction of $A$ is 1. Intuitively
speaking, if we have determined that a molecule belongs to a specific class (e.g., aromatic primary alcohol), it also
belongs to the direct and indirect superclasses (e.g., primary alcohol, aromatic alcohol, alcohol).
- (2) Next, we check for disjointness. This is not specified directly in ChEBI, but in an additional ChEBI module ([chebi-disjoints.owl](https://ftp.ebi.ac.uk/pub/databases/chebi/ontology/)).
We have extracted these disjointness axioms into a CSV file and added some more disjointness axioms ourselves (see
`data>disjoint_chebi.csv` and `data>disjoint_additional.csv`). If two classes $A$ and $B$ are disjoint and we predict
both, we select one with the higher class score and set the other to 0.
- (3) Since the second step might have introduced new inconsistencies into the hierarchy, we repeat the first step, but
- (3) Since the second step might have introduced new inconsistencies into the hierarchy, we repeat the first step, but
with a small change. For a pair of classes $A \subseteq B$ with predictions $1$ and $0$, instead of setting $B$ to $1$,
we now set $A$ to $0$. This has the advantage that we cannot introduce new disjointness-inconsistencies and don't have
to repeat step 2.
8 changes: 3 additions & 5 deletions chebifier/ensemble/base_ensemble.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,15 +3,15 @@

import torch
import tqdm
from chebifier.inconsistency_resolution import PredictionSmoother
from chebifier.utils import load_chebi_graph, get_disjoint_files

from chebifier.check_env import check_package_installed
from chebifier.hugging_face import download_model_files
from chebifier.inconsistency_resolution import PredictionSmoother
from chebifier.prediction_models.base_predictor import BasePredictor
from chebifier.utils import get_disjoint_files, load_chebi_graph


class BaseEnsemble:

def __init__(
self,
model_configs: dict,
Expand All @@ -29,8 +29,6 @@ def __init__(
for model_name, model_config in model_configs.items():
model_cls = MODEL_TYPES[model_config["type"]]
if "hugging_face" in model_config:
from chebifier.hugging_face import download_model_files

hugging_face_kwargs = download_model_files(model_config["hugging_face"])
else:
hugging_face_kwargs = {}
Expand Down
2 changes: 1 addition & 1 deletion configs/example_config.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@

chemlog_peptides:
type: chemlog
type: chemlog_peptides
model_weight: 100 # if chemlog is available, it always gets chosen
my_resgated:
type: resgated
Expand Down
22 changes: 0 additions & 22 deletions configs/huggingface_config.yml

This file was deleted.