A pipeline that generates error‑correcting–code datasets and then trains a deep learning model to distinguish Goppa and other structured codes from Random codes.
- Requirements
- Quick Start
- Directory Layout
- Dataset Generation
- Training
- Environment & Dependencies
- Citation
- License
- MacOS or Linux
- At least one NVIDIA GPU
- Python version >= 3.10
- Pytorch preinstalled
- SageMath, see Dependencies on how to install.
# 1. Create environment (Conda + uv pip)
$ conda create -n ai4code python=3.10 -y
$ conda activate ai4code
# 2. Install dependencies
$ pip install -I uv
$ uv pip install -r requirements.txtIf you don't have sage installed, you can install it by following Dependencies
Now you can run the project by first generating some data then training on it:
Generating 10k samples of binary Goppa codes of code length
# 3. Generate Goppa and Random datasets (less than a minute on 8‑core laptop)
$ ./scripts/gen_dataset.shNow, let's launch the training on gpu=0:
# 4. Train on the Goppa dataset (GPU 0)
$ ./scripts/train.sh # if gpu is not available, change device to cpuYou should get a final log after 1m08:
INFO - 07/07/25 19:43:56 - 0:01:08 - {'eval/0/accuracy': 0.995, 'eval/0/recall_1': 1.0, 'eval/0/recall_0': 0.990, 'eval/eval_loss': 0.030}
INFO - 07/07/25 19:43:56 - 0:01:08 - Finishing Training
✅ Training finished.
.
├── scripts/
│ ├── gen_dataset.sh # data pipeline (generate + collect)
│ ├── train.sh # training wrapper
│ ├── generate_data.py # paralellizable script to generate a lot of data
│ └── collect_data.py # assemble data generated by many workers into one H5 file.
├── data/ # generated datasets land here
│ ├── dataset_goppa_32_H5/ # example of generated goppa dataset
│ └── dataset_random_32_H5/ # example of generated random dataset
├── train.py # main training entry‑point
├── src/ # source code
│ ├── trainer.py # Trainer
│ ├── data/ # contains Dataset, Tokenizer, Generator, and DataSource classes.
│ ├── model/ # different models depending on task
│ └── ... # optim, logger, metrics, etc.
├── README.md # this file
├── requirements.txt # environment dependencies
└── notebooks/
Key parameters are exposed at the top of the Bash script ./scripts/gen_dataset.sh:
NUM_WORKERS=10 # parallel processes
N_SAMPLES=10000 # per‑code family
CODE_LEN=32 # code length
T_ALT=2 # error‑correction capability
M_ALT=5 # extension degreeModify them in one place and rerun.
Under the hood of those scripts, these are the commands being run for data generation:
scripts.generate_datafor Goppa codesscripts.collect_dataon the generated shardscripts.generate_datafor Random codesscripts.collect_dataon the second shard
$ python -m scripts.generate_data --code goppa --num_workers 10 --n_samples 10000 --code_len 32 --t_alt 2 --save_every 1000 --m_alt 5 --dump_path ./data --exp_name dataset_goppa_32$ python -m scripts.collect_data --data_path ./data/dataset_goppa_32 --n_samples 10000 --code goppa --code_len 32 --m_alt 5 --t_alt 2 --Q 2$ python -m scripts.generate_data --code random --num_workers 10 --n_samples 10000 --code_len 32 --t_alt 2 --save_every 1000 --m_alt 5 --dump_path ./data --exp_name dataset_random_32$ python -m scripts.collect_data --data_path ./data/dataset_random_32 --n_samples 10000 --code random --code_len 32 --m_alt 5 --t_alt 2 --Q 2./scripts/train.sh starts distinguisher model training on GPU 0 by default.
All hyper‑parameters (batch sizes, validation cadence, etc.) are defined at the top of the script so you can version‑control them easily.
Ro run DeepDistinguisher on Goppa codes (versus Random codes by default), use task = code-dist-goppa and provide the data_path for goppa codes. (The data path for random codes should be at data_path.replace('goppa', 'random'))
$ CUDA_VISIBLE_DEVICES=0 python train.py --task 'code-dist-goppa' --train_batch_size 32 --val_batch_size 1000 --eval_samples 1000 --train_samples 20000 --val_every 500 --log_every 10 --code_len 32 --data_path './data/dataset_goppa_32_H5/goppa_nmt_32_5_2/dataset_10K.h5' --m_alt 5 --t_alt 2 --tqdm True --Q 2To run DeepRecover on goppa dataset, we use task=code-complete-goppa and specify the mask size n_masked which is how many matrix entries are hidden from the model. This argument can either be an integer
$ CUDA_VISIBLE_DEVICES=0 python train.py --task 'code-complete-goppa' --train_batch_size 32 --val_batch_size 1000 --eval_samples 1000 --train_samples 10000 --val_every 500 --log_every 10 --code_len 32 --data_path './data/dataset_goppa_32_H5/goppa_nmt_32_5_2/dataset_10K.h5' --m_alt 5 --t_alt 2 --tqdm True --Q 2 --n_masked 20The project imports the sage module. Choose one of the methods below and make sure it is done before you run the generation script:
| Platform | Command |
|---|---|
| Conda (Linux/macOS/Windows) | conda install -c conda-forge sage |
| macOS (Homebrew) | brew install sagemath |
| Ubuntu / Debian | sudo apt-get update && sudo apt-get install sagemath |
| Docker (isolated) | docker run --rm -it -v $PWD:/work -w /work sagemath/sagemath sage -python train.py ... |
Test your install:
sage -python - <<'PY'
import sageall, sys
print("✅ SageMath available (Python %s)" % sys.version.split()[0])
PYIf you use this benchmark in your research, please use the following BibTeX entry.
@misc{cryptoeprint:2025/440,
author = {Mohamed Malhou and Ludovic Perret and Kristin Lauter},
title = {{AI} for Code-based Cryptography},
howpublished = {Cryptology {ePrint} Archive, Paper 2025/440},
year = {2025},
url = {https://eprint.iacr.org/2025/440}
}
This code is made available under CC-by-NC, however you may have other legal obligations that govern your use of other content, such as the terms of service for third-party models.