Skip to content

Implementation of Group Relative Policy Optimization (GRPO) and Evolutionary Strategy (ES) to fine-tune Open Language Models (like LlaMa-3.2, Qwen2.5) for Tasks with verifiable rewards.

License

Notifications You must be signed in to change notification settings

Bharath2/Minimal-GRPO

Repository files navigation

Minimal-GRPO

A minimal and customizable implementation for fine-tuning Open Language Models (LLaMA, Qwen, etc.) on reasoning tasks with verifiable rewards, using two approaches:

  • GRPO (Group Relative Policy Optimization): Gradient-based RL policy optimization introduced by DeepSeekMath, with support for LoRA adaptation
  • ES (Evolution Strategy): Gradient-free evolutionary optimization with full parameter fine-tuning based recent work by (Qiu et al., 2025)

This repo currently includes GSM8K (by OpenAI) and MathExpr datasets, and can be adapted to other tasks.

Features

  • Effective, minimal codebase: Easy to understand and modify
  • Two optimization strategies: Compare gradient-based (GRPO) vs gradient-free (ES) approaches
  • LoRA support for GRPO: Parameter-efficient fine-tuning with PEFT
  • Full parameter ES fine-tuning: Direct optimization in billion-parameter spaces
  • Flexible configuration: YAML-based configuration for both algorithms
  • Multiple datasets: GSM8K and MathExpr included, easily extensible
  • Custom rewards: Easily adapt to your own tasks and reward functions

Setup

  1. Clone the repository:
git clone https://github.com/yourusername/Minimal-GRPO.git
cd Minimal-GRPO
  1. Install dependencies:
pip install -r requirements.txt
  1. Optional (recommended for efficiency):
pip install flash-attn --no-build-isolation

Quick Start

  1. Configure your training in grpo_config.yml or es_config.yml

  2. Run the corresponding training script:

python grpo_train.py
  1. Monitor with TensorBoard:
tensorboard --logdir=grpo_logs

Adapting to Your Own Tasks

To adapt this code to your own dataset and task:

  1. Implement your dataset in datasets.py (follow the GSM8K or MathExpr examples)
  2. Define your reward function in reward.py to match your task's success criteria
  3. Adjust the system prompt in the training scripts (grpo_train.py or es_train.py) to match your task format
  4. Update the DataLoader in the training script to use your new dataset

GRPO (Group Relative Policy Optimization)

GRPO is a gradient-based reinforcement learning algorithm that:

  • Uses group-based advantage estimation for stable policy updates
  • Combines PPO-style clipping with KL divergence regularization
  • Maintains a reference model to prevent catastrophic forgetting
  • Efficiently leverages gradient information for policy improvement
  • LoRA (Low-Rank Adaptation) for parameter-efficient fine-tuning

ES (Evolution Strategies)

Recent research demonstrates that ES can successfully fine-tune LLMs with billions of parameters (Qiu et al., 2025), outperforming RL methods in sample efficiency, robustness, and stability, particularly on tasks with sparse outcome-only rewards, ES is inspired by natural evolution. It:

  • Perturbs model parameters with Gaussian noise
  • Evaluates fitness using task-specific rewards
  • Updates parameters based on relative performance (z-score normalization)
  • Requires only forward passes (no backpropagation)
  • Less prone to reward hacking than RL methods

Contributing

Contributions are welcome! Please open an issue or submit a pull request.

References

  • DeepSeekMath: Shao et al., "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models", 2024. arXiv:2402.03300
  • ES at Scale for LLMs: Qiu et al., "Evolution Strategies at Scale: LLM Fine-Tuning Beyond Reinforcement Learning", 2025. arXiv:2509.24372
  • Salimans et al., "Evolution Strategies as a Scalable Alternative to Reinforcement Learning", 2017. arXiv:1703.03864

Acknowledgments

About

Implementation of Group Relative Policy Optimization (GRPO) and Evolutionary Strategy (ES) to fine-tune Open Language Models (like LlaMa-3.2, Qwen2.5) for Tasks with verifiable rewards.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published