A minimal and customizable implementation for fine-tuning Open Language Models (LLaMA, Qwen, etc.) on reasoning tasks with verifiable rewards, using two approaches:
- GRPO (Group Relative Policy Optimization): Gradient-based RL policy optimization introduced by DeepSeekMath, with support for LoRA adaptation
- ES (Evolution Strategy): Gradient-free evolutionary optimization with full parameter fine-tuning based recent work by (Qiu et al., 2025)
This repo currently includes GSM8K (by OpenAI) and MathExpr datasets, and can be adapted to other tasks.
- ✅ Effective, minimal codebase: Easy to understand and modify
- ✅ Two optimization strategies: Compare gradient-based (GRPO) vs gradient-free (ES) approaches
- ✅ LoRA support for GRPO: Parameter-efficient fine-tuning with PEFT
- ✅ Full parameter ES fine-tuning: Direct optimization in billion-parameter spaces
- ✅ Flexible configuration: YAML-based configuration for both algorithms
- ✅ Multiple datasets: GSM8K and MathExpr included, easily extensible
- ✅ Custom rewards: Easily adapt to your own tasks and reward functions
- Clone the repository:
git clone https://github.com/yourusername/Minimal-GRPO.git
cd Minimal-GRPO- Install dependencies:
pip install -r requirements.txt- Optional (recommended for efficiency):
pip install flash-attn --no-build-isolation-
Configure your training in
grpo_config.ymlores_config.yml -
Run the corresponding training script:
python grpo_train.py- Monitor with TensorBoard:
tensorboard --logdir=grpo_logsTo adapt this code to your own dataset and task:
- Implement your dataset in
datasets.py(follow the GSM8K or MathExpr examples) - Define your reward function in
reward.pyto match your task's success criteria - Adjust the system prompt in the training scripts (
grpo_train.pyores_train.py) to match your task format - Update the DataLoader in the training script to use your new dataset
GRPO is a gradient-based reinforcement learning algorithm that:
- Uses group-based advantage estimation for stable policy updates
- Combines PPO-style clipping with KL divergence regularization
- Maintains a reference model to prevent catastrophic forgetting
- Efficiently leverages gradient information for policy improvement
- LoRA (Low-Rank Adaptation) for parameter-efficient fine-tuning
Recent research demonstrates that ES can successfully fine-tune LLMs with billions of parameters (Qiu et al., 2025), outperforming RL methods in sample efficiency, robustness, and stability, particularly on tasks with sparse outcome-only rewards, ES is inspired by natural evolution. It:
- Perturbs model parameters with Gaussian noise
- Evaluates fitness using task-specific rewards
- Updates parameters based on relative performance (z-score normalization)
- Requires only forward passes (no backpropagation)
- Less prone to reward hacking than RL methods
Contributions are welcome! Please open an issue or submit a pull request.
- DeepSeekMath: Shao et al., "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models", 2024. arXiv:2402.03300
- ES at Scale for LLMs: Qiu et al., "Evolution Strategies at Scale: LLM Fine-Tuning Beyond Reinforcement Learning", 2025. arXiv:2509.24372
- Salimans et al., "Evolution Strategies as a Scalable Alternative to Reinforcement Learning", 2017. arXiv:1703.03864
- Built with PyTorch, Transformers, Accelerate and Models from Hugging Face