This repository produces binary classifiers on low-level jet information where the background is made of an impure mixture: signal vs (signal + background). The project utilizes the transformer architecture with modern techniques (FlashAttention/LayerScale/ContextModulation/etc).
This repo uses PyTorch
and Lightning
to facilitate training and evaluation of the model, Hydra
for configuration management, Weights-and-Biases
for logging, and Snakemake
for workflow management.
Additionally, some models require FlashAttention to operate, which necessitates running on Nvidia Ampere (or later) GPUs.
├── configs/ # Hydra configuration files
├── docker/ # Docker configuration
├── mltools/ # Local ML utilities package
├── plots/ # Generated plots directory
├── scripts/ # Utility scripts
├── src/ # Main source code
│ ├── data/ # Data loading and preprocessing
│ ├── models/ # Model definitions
│ └── ...
└── workflow/ # Snakemake workflow files
To run this project, clone and download the repository.
This project relies on a custom submodule called mltools
stored here on CERN GitLab.
This is a collection of useful functions, layers and networks for deep learning developed by the RODEM group at UNIGE.
If you didn't clone the project with the --recursive
flag you can pull the submodule using:
git submodule update --init --recursive
This project is setup to use the CERN GitLab CI/CD to automatically build a Docker image based on the docker/Dockerfile
and requirements.txt
when a commit is pushed.
The latest images can be found here.
Alternatively, you can build the Docker image locally using the file docker/Dockerfile
.
To install the packages manually, you can use the requirements.txt
file.
However, installing FlashAttention requires the packages ninja and packaging to already be installed!
pip install packaging ninja
pip install -r requirements.txt
The dataset is based on the LHCO Anomaly Detection Challenge dataset. Specifically we use the:
Pythia
generated mixture (signal and background) which can be found here under the nameevents_anomalydetection_v2.h5
Herwig
generated QCD datasets (background only) which can be found here under the nameevents_LHCO2020_BlackBox2.h5
These datasets must be first split using scripts/sort_split.py
and clustered using scripts/cluster.py
.
You can find details for these steps in the Workflow section.
To test if these have been done correctly, you can use the scripts/plot_data.py
script to inspect the data.
Two models are trained in this repo:
- Classifier: The binary CWoLA classifier
- SSFM: A self-supervised model which can be finetuned to the CWoLA task. See the paper here for more details.
Training configuration is done through Hydra, which not only allows one to configure a class, but also choose which class to use via its instantiation method.
The main configuration file that composes training is train.yaml
:
- This file sets up all the paths to save the network, the trainer and logger
- It also imports additional yaml files from the other folders:
model
: Chooses which model to traincallbacks
: Chooses which callbacks to usedatamodule
: Chooses which datamodule to useexperiment
: Overwrites any of the config values before composition
The workflow
directory contains a snakemake file that covers all the steps from:
- Data preprocessing
- SSFM pretraining
- CWoLA finetuning
- Combining scores across folds
- Plotting
The pipeline configuration is found in configs/pipeline.yaml
, and the workflow profile is found in workflow/config.yaml
.
Specific commands in workflow/config.yaml
are designed explicitly for the UNIGE HPC cluster and will need to be modified for other systems:
- It uses SLURM as the executor plugin and requires an apptainer image to have been built
- Snakemake's updates are not backwards compatible, so specific versions are required to run the workflow
Install the required packages using:
pip install snakemake-executor-plugin-slurm==0.4.1 snakemake==8.4.1
Then run the workflow using:
snakemake --snakefile workflow/example.smk --workflow-profile workflow/
To just build the dag and not run the workflow, append the following to the above command:
... -e dryrun --dag | dot -Tpng > workflow/example.png
This creates the following image:
This project is licensed under the MIT License. See the LICENSE file for details.