Principle:Unslothai Unsloth GRPO Reinforcement Learning
| Knowledge Sources | |
|---|---|
| Domains | Reinforcement_Learning, NLP, Optimization |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
A reinforcement learning algorithm that optimizes language model policy using group-relative advantage estimation from multiple sampled completions per prompt, without requiring a separate value model.
Description
Group Relative Policy Optimization (GRPO) is a variant of policy gradient methods designed specifically for language model training. Unlike PPO which requires a trained value function (critic) to estimate advantages, GRPO estimates advantages by sampling multiple completions for each prompt and computing relative rewards within the group.
Key characteristics:
- No Value Model: Eliminates the need for a separate critic network, reducing memory and complexity.
- Group Advantage: For each prompt, generate K completions and compute advantages as normalized reward deviations from the group mean.
- KL Penalty: Includes a KL divergence penalty against the reference (initial) policy to prevent reward hacking.
- Memory Efficiency: Unsloth's implementation uses chunked gradient accumulation (unsloth_num_chunks) to process large batches without OOM.
GRPO is particularly effective for training reasoning capabilities (mathematical problem-solving, code generation) where correctness can be verified programmatically via reward functions.
Usage
Use GRPO when training models for tasks with verifiable outcomes (math, code, logic puzzles). Requires defining reward functions that can score model completions. Typically preceded by an SFT warmup phase and requires vLLM-enabled model loading for fast rollout generation.
Theoretical Basis
The GRPO objective for a prompt with group of completions :
Where the group-relative advantage is:
# Abstract GRPO training step
completions = model.fast_generate(prompts, n=num_generations)
rewards = [reward_fn(p, c) for p, c in zip(prompts, completions)]
# Group-relative advantage
for group in groups:
mean_r = mean(group.rewards)
std_r = std(group.rewards)
group.advantages = [(r - mean_r) / std_r for r in group.rewards]
# Policy gradient with clipping
loss = clipped_policy_gradient(model, completions, advantages) + beta * kl_penalty