A variant of GRPO (Group Relative Policy Optimization) that uses softmax-weighted advantages for smoother and more stable training.
SR-GRPO modifies the standard GRPO algorithm by changing how advantages are computed within each group of completions. Instead of using simple mean-normalized advantages, SR-GRPO applies softmax weighting based on the rewards.
| Aspect | Standard GRPO | SR-GRPO |
|---|---|---|
| Advantage | (reward - mean) / std |
Softmax-weighted sum |
| Weighting | Uniform | Higher weight on better completions |
| Temperature | N/A | Configurable via tau parameter |
For a group of K completions with rewards
-
Normalize rewards within the group:
$$z_i = \frac{r_i - \mu}{\sigma + \epsilon}$$ -
Compute softmax weights:
$$w_i = \frac{\exp(z_i / \tau)}{\sum_j \exp(z_j / \tau)}$$ -
Compute soft advantage:
$$A_{soft} = \sum_i w_i \cdot r_i$$ -
Broadcast to all samples in the group: All K samples use the same
$A_{soft}$ as their advantage.
The temperature parameter
- Lower τ (e.g., 0.1): Sharper weights, more focus on the best completions
- Higher τ (e.g., 1.0): Smoother weights, more uniform influence
No additional installation required! This module inherits from the trl library's GRPOTrainer. Just ensure you have:
pip install trl transformers accelerate
pip install unslothgrpo/
├── module/
├── sr_grpo_trainer.py # SR-GRPO Trainer implementation
├── train_sr_grpo.py # Example training script
├── eval_gsm8k.py
└── README.md # This file
All parameters from GRPOConfig are supported, plus:
| Parameter | Type | Default | Description |
|---|---|---|---|
tau |
float | 0.5 | Temperature for softmax weighting |
| Parameter | Type | Default | Description |
|---|---|---|---|
num_generations |
int | 8 | Number of completions per prompt |
beta |
float | 0.1 | KL penalty coefficient |
max_prompt_length |
int | 256 | Maximum prompt length |
max_completion_length |
int | 512 | Maximum completion length |
use_vllm |
bool | False | Use vLLM for fast generation |
SR-GRPO logs additional metrics compared to standard GRPO:
soft_advantage_mean: Mean of computed soft advantagessoft_advantage_std: Standard deviation of soft advantagesreward: Mean reward across all completionsreward_std: Standard deviation of rewardskl: KL divergence from reference modelcompletion_length: Average completion length
See train_sr_grpo.py for a complete example training on the GSM8K math dataset.
python train_sr_grpo.py- τ = 0.1 - 0.3: Aggressive weighting, focuses heavily on best samples. Good when you have clear quality differences.
- τ = 0.5: Balanced weighting (default). Works well for most cases.
- τ = 1.0+: Smooth weighting, closer to standard GRPO. Good when reward signals are noisy.