posttraining-data
is a turn-key 8-stage pipeline for processing HuggingFace datasets into training-ready format. It was used to prepare Apertus' post-training data and notably its SFT mixture. More information can be found in the Apertus tech report.
The pipeline consists of the following self-contained stages:
- 01-hf-download: Downloads HuggingFace datasets with metadata tracking → produces HF DatasetDict
- 02-standardisation: Converts datasets to unified chat format → produces HF DatasetDict
- 03-license-based-filtering: Removes samples with licensing restrictions → produces HF DatasetDict
- 04-decontamination: Removes contaminated samples from evaluation sets → produces HF DatasetDict
- 05-annotations: Adds LLM-based classifications and language detection → produces HF DatasetDict
- 06-field-based-filtering: General field analysis and filtering → produces HF DatasetDict
- 07-dataset-aggregation: Combines multiple datasets into training mixtures → produces HF Dataset ready for training
- 08-judge-evaluation: Evaluates datasets with LLM judges.
A few additional running scripts and miscellaneous commands are also provided in examples
.
Create virtual environment and install dependencies:
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt