Skip to content

facebookresearch/ai4code-cryptanalysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

AI for Code-Based Cryptography: Code Distinguisher

A pipeline that generates error‑correcting–code datasets and then trains a deep learning model to distinguish Goppa and other structured codes from Random codes.

Table of Contents

  1. Requirements
  2. Quick Start
  3. Directory Layout
  4. Dataset Generation
  5. Training
  6. Environment & Dependencies
  7. Citation
  8. License

Requirements

  • MacOS or Linux
  • At least one NVIDIA GPU
  • Python version >= 3.10
  • Pytorch preinstalled
  • SageMath, see Dependencies on how to install.

Quick Start

# 1. Create environment (Conda + uv pip)
$ conda create -n ai4code python=3.10 -y
$ conda activate ai4code

# 2. Install dependencies
$ pip install -I uv
$ uv pip install -r requirements.txt

If you don't have sage installed, you can install it by following Dependencies

Now you can run the project by first generating some data then training on it:

Generating 10k samples of binary Goppa codes of code length $n=32$, extension degree $m=5$, and polynomial degree $t=2$ and 10k random codes with the same dimension $k=n-mt$:

# 3. Generate Goppa and Random datasets (less than a minute on 8‑core laptop)
$ ./scripts/gen_dataset.sh

Now, let's launch the training on gpu=0:

# 4. Train on the Goppa dataset (GPU 0)
$ ./scripts/train.sh   # if gpu is not available, change device to cpu

You should get a final log after 1m08:

INFO - 07/07/25 19:43:56 - 0:01:08 - {'eval/0/accuracy': 0.995, 'eval/0/recall_1': 1.0, 'eval/0/recall_0': 0.990, 'eval/eval_loss': 0.030}
INFO - 07/07/25 19:43:56 - 0:01:08 - Finishing Training
✅  Training finished.

Directory Layout

.
├── scripts/
│   ├── gen_dataset.sh              # data pipeline (generate + collect)
│   ├── train.sh                    # training wrapper
│   ├── generate_data.py            # paralellizable script to generate a lot of data
│   └── collect_data.py             # assemble data generated by many workers into one H5 file.
├── data/                           # generated datasets land here
│   ├── dataset_goppa_32_H5/        # example of generated goppa dataset
│   └── dataset_random_32_H5/       # example of generated random dataset
├── train.py                        # main training entry‑point
├── src/                            # source code
│   ├── trainer.py                  # Trainer 
│   ├── data/                       # contains Dataset, Tokenizer, Generator, and DataSource classes.
│   ├── model/                      # different models depending on task
│   └── ...                         # optim, logger, metrics, etc.
├── README.md                       # this file
├── requirements.txt                # environment dependencies
└── notebooks/

Dataset Generation

Key parameters are exposed at the top of the Bash script ./scripts/gen_dataset.sh:

NUM_WORKERS=10   # parallel processes
N_SAMPLES=10000  # per‑code family
CODE_LEN=32      # code length
T_ALT=2          # error‑correction capability
M_ALT=5          # extension degree

Modify them in one place and rerun.


Under the hood of those scripts, these are the commands being run for data generation:

  1. scripts.generate_data for Goppa codes
  2. scripts.collect_data on the generated shard
  3. scripts.generate_data for Random codes
  4. scripts.collect_data on the second shard

Spawn 10 workers to generate Goppa codes

$ python -m scripts.generate_data --code goppa --num_workers 10  --n_samples 10000 --code_len 32 --t_alt 2 --save_every 1000 --m_alt 5 --dump_path ./data --exp_name dataset_goppa_32

Gather generated data and store in H5 file.

$ python -m scripts.collect_data --data_path ./data/dataset_goppa_32 --n_samples 10000 --code goppa --code_len 32  --m_alt 5 --t_alt 2 --Q 2

Spawn 10 workers to generate Random codes

$ python -m scripts.generate_data --code random --num_workers 10  --n_samples 10000 --code_len 32 --t_alt 2 --save_every 1000 --m_alt 5 --dump_path ./data --exp_name dataset_random_32

Gather Random code data and store in H5 file.

$ python -m scripts.collect_data --data_path ./data/dataset_random_32 --n_samples 10000 --code random --code_len 32  --m_alt 5 --t_alt 2 --Q 2

Training

./scripts/train.sh starts distinguisher model training on GPU 0 by default.

All hyper‑parameters (batch sizes, validation cadence, etc.) are defined at the top of the script so you can version‑control them easily.

Generator Matrix Distinguisher: DeepDistinguisher

Ro run DeepDistinguisher on Goppa codes (versus Random codes by default), use task = code-dist-goppa and provide the data_path for goppa codes. (The data path for random codes should be at data_path.replace('goppa', 'random'))

$ CUDA_VISIBLE_DEVICES=0 python train.py --task 'code-dist-goppa'  --train_batch_size 32 --val_batch_size 1000 --eval_samples 1000 --train_samples 20000  --val_every 500 --log_every 10 --code_len 32 --data_path './data/dataset_goppa_32_H5/goppa_nmt_32_5_2/dataset_10K.h5' --m_alt 5 --t_alt 2  --tqdm True --Q 2

Generator Matrix Completion: DeepRecover

To run DeepRecover on goppa dataset, we use task=code-complete-goppa and specify the mask size n_masked which is how many matrix entries are hidden from the model. This argument can either be an integer $\geq 1$ or a float $ < 1$ in which case it's interpreted as the probability of hiding an entry.

$ CUDA_VISIBLE_DEVICES=0 python train.py --task 'code-complete-goppa'  --train_batch_size 32 --val_batch_size 1000 --eval_samples 1000 --train_samples 10000  --val_every 500 --log_every 10 --code_len 32 --data_path './data/dataset_goppa_32_H5/goppa_nmt_32_5_2/dataset_10K.h5' --m_alt 5 --t_alt 2  --tqdm True --Q 2 --n_masked 20

Environment & Dependencies

Installing SageMath (non‑pip dependency)

The project imports the sage module. Choose one of the methods below and make sure it is done before you run the generation script:

Platform Command
Conda (Linux/macOS/Windows) conda install -c conda-forge sage
macOS (Homebrew) brew install sagemath
Ubuntu / Debian sudo apt-get update && sudo apt-get install sagemath
Docker (isolated) docker run --rm -it -v $PWD:/work -w /work sagemath/sagemath sage -python train.py ...

Test your install:

sage -python - <<'PY'
import sageall, sys
print("✅ SageMath available (Python %s)" % sys.version.split()[0])
PY

Citation

If you use this benchmark in your research, please use the following BibTeX entry.

@misc{cryptoeprint:2025/440,
      author = {Mohamed Malhou and Ludovic Perret and Kristin Lauter},
      title = {{AI} for Code-based Cryptography},
      howpublished = {Cryptology {ePrint} Archive, Paper 2025/440},
      year = {2025},
      url = {https://eprint.iacr.org/2025/440}
}

License

This code is made available under CC-by-NC, however you may have other legal obligations that govern your use of other content, such as the terms of service for third-party models.

About

Codebase for reproducing the `AI for Code-based Cryptography` paper. It contains code to train a Transformer-based distinguisher and to generate different linear code samples such as Goppa codes, alternant codes, and QC-MDPC codes.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors