Minimal-GRPO

A minimal and customizable implementation for fine-tuning Open Language Models (LLaMA, Qwen, etc.) on reasoning tasks with verifiable rewards, using two approaches:

GRPO (Group Relative Policy Optimization): Gradient-based RL policy optimization introduced by DeepSeekMath, with support for LoRA adaptation
ES (Evolution Strategy): Gradient-free evolutionary optimization with full parameter fine-tuning based recent work by (Qiu et al., 2025)

This repo currently includes GSM8K (by OpenAI) and MathExpr datasets, and can be adapted to other tasks.

Features

✅ Effective, minimal codebase: Easy to understand and modify
✅ Two optimization strategies: Compare gradient-based (GRPO) vs gradient-free (ES) approaches
✅ LoRA support for GRPO: Parameter-efficient fine-tuning with PEFT
✅ Full parameter ES fine-tuning: Direct optimization in billion-parameter spaces
✅ Flexible configuration: YAML-based configuration for both algorithms
✅ Multiple datasets: GSM8K and MathExpr included, easily extensible
✅ Custom rewards: Easily adapt to your own tasks and reward functions

Setup

Clone the repository:

git clone https://github.com/yourusername/Minimal-GRPO.git
cd Minimal-GRPO

Install dependencies:

pip install -r requirements.txt

Optional (recommended for efficiency):

pip install flash-attn --no-build-isolation

Quick Start

Configure your training in grpo_config.yml or es_config.yml
Run the corresponding training script:

python grpo_train.py

Monitor with TensorBoard:

tensorboard --logdir=grpo_logs

Adapting to Your Own Tasks

To adapt this code to your own dataset and task:

Implement your dataset in datasets.py (follow the GSM8K or MathExpr examples)
Define your reward function in reward.py to match your task's success criteria
Adjust the system prompt in the training scripts (grpo_train.py or es_train.py) to match your task format
Update the DataLoader in the training script to use your new dataset

GRPO (Group Relative Policy Optimization)

GRPO is a gradient-based reinforcement learning algorithm that:

Uses group-based advantage estimation for stable policy updates
Combines PPO-style clipping with KL divergence regularization
Maintains a reference model to prevent catastrophic forgetting
Efficiently leverages gradient information for policy improvement
LoRA (Low-Rank Adaptation) for parameter-efficient fine-tuning

ES (Evolution Strategies)

Recent research demonstrates that ES can successfully fine-tune LLMs with billions of parameters (Qiu et al., 2025), outperforming RL methods in sample efficiency, robustness, and stability, particularly on tasks with sparse outcome-only rewards, ES is inspired by natural evolution. It:

Perturbs model parameters with Gaussian noise
Evaluates fitness using task-specific rewards
Updates parameters based on relative performance (z-score normalization)
Requires only forward passes (no backpropagation)
Less prone to reward hacking than RL methods

Contributing

Contributions are welcome! Please open an issue or submit a pull request.

References

DeepSeekMath: Shao et al., "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models", 2024. arXiv:2402.03300
ES at Scale for LLMs: Qiu et al., "Evolution Strategies at Scale: LLM Fine-Tuning Beyond Reinforcement Learning", 2025. arXiv:2509.24372
Salimans et al., "Evolution Strategies as a Scalable Alternative to Reinforcement Learning", 2017. arXiv:1703.03864

Acknowledgments

Built with PyTorch, Transformers, Accelerate and Models from Hugging Face

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
datasets		datasets
results		results
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
es_config.yml		es_config.yml
es_train.py		es_train.py
grpo_config.yml		grpo_config.yml
grpo_loss.py		grpo_loss.py
grpo_train.py		grpo_train.py
model.py		model.py
requirements.txt		requirements.txt
reward.py		reward.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Minimal-GRPO

Features

Setup

Quick Start

Adapting to Your Own Tasks

GRPO (Group Relative Policy Optimization)

ES (Evolution Strategies)

Contributing

References

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

License

Bharath2/Minimal-GRPO

Folders and files

Latest commit

History

Repository files navigation

Minimal-GRPO

Features

Setup

Quick Start

Adapting to Your Own Tasks

GRPO (Group Relative Policy Optimization)

ES (Evolution Strategies)

Contributing

References

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages