AI for Code-Based Cryptography: Code Distinguisher

A pipeline that generates error‑correcting–code datasets and then trains a deep learning model to distinguish Goppa and other structured codes from Random codes.

Paper: AI for Code-based Cryptography

Requirements

MacOS or Linux
At least one NVIDIA GPU
Python version >= 3.10
Pytorch preinstalled
SageMath, see Dependencies on how to install.

Quick Start

# 1. Create environment (Conda + uv pip)
$ conda create -n ai4code python=3.10 -y
$ conda activate ai4code

# 2. Install dependencies
$ pip install -I uv
$ uv pip install -r requirements.txt

If you don't have sage installed, you can install it by following Dependencies

Now you can run the project by first generating some data then training on it:

Generating 10k samples of binary Goppa codes of code length $n=32$, extension degree $m=5$, and polynomial degree $t=2$ and 10k random codes with the same dimension $k=n-mt$:

# 3. Generate Goppa and Random datasets (less than a minute on 8‑core laptop)
$ ./scripts/gen_dataset.sh

Now, let's launch the training on gpu=0:

# 4. Train on the Goppa dataset (GPU 0)
$ ./scripts/train.sh   # if gpu is not available, change device to cpu

You should get a final log after 1m08:

INFO - 07/07/25 19:43:56 - 0:01:08 - {'eval/0/accuracy': 0.995, 'eval/0/recall_1': 1.0, 'eval/0/recall_0': 0.990, 'eval/eval_loss': 0.030}
INFO - 07/07/25 19:43:56 - 0:01:08 - Finishing Training
✅  Training finished.

Directory Layout

.
├── scripts/
│   ├── gen_dataset.sh              # data pipeline (generate + collect)
│   ├── train.sh                    # training wrapper
│   ├── generate_data.py            # paralellizable script to generate a lot of data
│   └── collect_data.py             # assemble data generated by many workers into one H5 file.
├── data/                           # generated datasets land here
│   ├── dataset_goppa_32_H5/        # example of generated goppa dataset
│   └── dataset_random_32_H5/       # example of generated random dataset
├── train.py                        # main training entry‑point
├── src/                            # source code
│   ├── trainer.py                  # Trainer 
│   ├── data/                       # contains Dataset, Tokenizer, Generator, and DataSource classes.
│   ├── model/                      # different models depending on task
│   └── ...                         # optim, logger, metrics, etc.
├── README.md                       # this file
├── requirements.txt                # environment dependencies
└── notebooks/

Dataset Generation

Key parameters are exposed at the top of the Bash script ./scripts/gen_dataset.sh:

NUM_WORKERS=10   # parallel processes
N_SAMPLES=10000  # per‑code family
CODE_LEN=32      # code length
T_ALT=2          # error‑correction capability
M_ALT=5          # extension degree

Modify them in one place and rerun.

Under the hood of those scripts, these are the commands being run for data generation:

scripts.generate_data for Goppa codes
scripts.collect_data on the generated shard
scripts.generate_data for Random codes
scripts.collect_data on the second shard

Spawn 10 workers to generate Goppa codes

$ python -m scripts.generate_data --code goppa --num_workers 10  --n_samples 10000 --code_len 32 --t_alt 2 --save_every 1000 --m_alt 5 --dump_path ./data --exp_name dataset_goppa_32

Gather generated data and store in H5 file.

$ python -m scripts.collect_data --data_path ./data/dataset_goppa_32 --n_samples 10000 --code goppa --code_len 32  --m_alt 5 --t_alt 2 --Q 2

Spawn 10 workers to generate Random codes

$ python -m scripts.generate_data --code random --num_workers 10  --n_samples 10000 --code_len 32 --t_alt 2 --save_every 1000 --m_alt 5 --dump_path ./data --exp_name dataset_random_32

Gather Random code data and store in H5 file.

$ python -m scripts.collect_data --data_path ./data/dataset_random_32 --n_samples 10000 --code random --code_len 32  --m_alt 5 --t_alt 2 --Q 2

Training

./scripts/train.sh starts distinguisher model training on GPU 0 by default.

All hyper‑parameters (batch sizes, validation cadence, etc.) are defined at the top of the script so you can version‑control them easily.

Generator Matrix Distinguisher: DeepDistinguisher

Ro run DeepDistinguisher on Goppa codes (versus Random codes by default), use task = code-dist-goppa and provide the data_path for goppa codes. (The data path for random codes should be at data_path.replace('goppa', 'random'))

$ CUDA_VISIBLE_DEVICES=0 python train.py --task 'code-dist-goppa'  --train_batch_size 32 --val_batch_size 1000 --eval_samples 1000 --train_samples 20000  --val_every 500 --log_every 10 --code_len 32 --data_path './data/dataset_goppa_32_H5/goppa_nmt_32_5_2/dataset_10K.h5' --m_alt 5 --t_alt 2  --tqdm True --Q 2

Generator Matrix Completion: DeepRecover

To run DeepRecover on goppa dataset, we use task=code-complete-goppa and specify the mask size n_masked which is how many matrix entries are hidden from the model. This argument can either be an integer $\geq 1$ or a float $ < 1$ in which case it's interpreted as the probability of hiding an entry.

$ CUDA_VISIBLE_DEVICES=0 python train.py --task 'code-complete-goppa'  --train_batch_size 32 --val_batch_size 1000 --eval_samples 1000 --train_samples 10000  --val_every 500 --log_every 10 --code_len 32 --data_path './data/dataset_goppa_32_H5/goppa_nmt_32_5_2/dataset_10K.h5' --m_alt 5 --t_alt 2  --tqdm True --Q 2 --n_masked 20

Environment & Dependencies

Installing SageMath (non‑pip dependency)

The project imports the sage module. Choose one of the methods below and make sure it is done before you run the generation script:

Platform	Command
Conda (Linux/macOS/Windows)	`conda install -c conda-forge sage`
macOS (Homebrew)	`brew install sagemath`
Ubuntu / Debian	`sudo apt-get update && sudo apt-get install sagemath`
Docker (isolated)	`docker run --rm -it -v $PWD:/work -w /work sagemath/sagemath sage -python train.py ...`

Test your install:

sage -python - <<'PY'
import sageall, sys
print("✅ SageMath available (Python %s)" % sys.version.split()[0])
PY

Citation

If you use this benchmark in your research, please use the following BibTeX entry.

@misc{cryptoeprint:2025/440,
      author = {Mohamed Malhou and Ludovic Perret and Kristin Lauter},
      title = {{AI} for Code-based Cryptography},
      howpublished = {Cryptology {ePrint} Archive, Paper 2025/440},
      year = {2025},
      url = {https://eprint.iacr.org/2025/440}
}

License

This code is made available under CC-by-NC, however you may have other legal obligations that govern your use of other content, such as the terms of service for third-party models.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
scripts		scripts
src		src
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI for Code-Based Cryptography: Code Distinguisher

Paper: AI for Code-based Cryptography

Table of Contents

Requirements

Quick Start

Directory Layout

Dataset Generation

Spawn 10 workers to generate Goppa codes

Gather generated data and store in H5 file.

Spawn 10 workers to generate Random codes

Gather Random code data and store in H5 file.

Training

Generator Matrix Distinguisher: DeepDistinguisher

Generator Matrix Completion: DeepRecover

Environment & Dependencies

Installing SageMath (non‑pip dependency)

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AI for Code-Based Cryptography: Code Distinguisher

Paper: AI for Code-based Cryptography

Table of Contents

Requirements

Quick Start

Directory Layout

Dataset Generation

Spawn 10 workers to generate Goppa codes

Gather generated data and store in H5 file.

Spawn 10 workers to generate Random codes

Gather Random code data and store in H5 file.

Training

Generator Matrix Distinguisher: DeepDistinguisher

Generator Matrix Completion: DeepRecover

Environment & Dependencies

Installing SageMath (non‑pip dependency)

Citation

License

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages