Principle:Unslothai Unsloth GRPO Reinforcement Learning

Knowledge Sources	DeepSeekMath: Pushing the Limits of Mathematical Reasoning Proximal Policy Optimization Algorithms Unsloth TRL GRPOTrainer
Domains	Reinforcement_Learning, NLP, Optimization
Last Updated	2026-02-07 00:00 GMT

Overview

A reinforcement learning algorithm that optimizes language model policy using group-relative advantage estimation from multiple sampled completions per prompt, without requiring a separate value model.

Description

Group Relative Policy Optimization (GRPO) is a variant of policy gradient methods designed specifically for language model training. Unlike PPO which requires a trained value function (critic) to estimate advantages, GRPO estimates advantages by sampling multiple completions for each prompt and computing relative rewards within the group.

Key characteristics:

No Value Model: Eliminates the need for a separate critic network, reducing memory and complexity.
Group Advantage: For each prompt, generate K completions and compute advantages as normalized reward deviations from the group mean.
KL Penalty: Includes a KL divergence penalty against the reference (initial) policy to prevent reward hacking.
Memory Efficiency: Unsloth's implementation uses chunked gradient accumulation (unsloth_num_chunks) to process large batches without OOM.

GRPO is particularly effective for training reasoning capabilities (mathematical problem-solving, code generation) where correctness can be verified programmatically via reward functions.

Usage

Use GRPO when training models for tasks with verifiable outcomes (math, code, logic puzzles). Requires defining reward functions that can score model completions. Typically preceded by an SFT warmup phase and requires vLLM-enabled model loading for fast rollout generation.

Theoretical Basis

The GRPO objective for a prompt $q$ with group of completions ${o_{1}, . . ., o_{G}}$ :

$ℒ_{G R P O} = - 𝔼_{q} [\frac{1}{G} \sum_{i = 1}^{G} \min (\frac{π_{θ} (o_{i} | q)}{π_{o l d} (o_{i} | q)} {\hat{A}}_{i}, clip (\frac{π_{θ} (o_{i} | q)}{π_{o l d} (o_{i} | q)}, 1 - ϵ, 1 + ϵ) {\hat{A}}_{i}) - β D_{K L} (π_{θ} | | π_{r e f})]$

Where the group-relative advantage is:

${\hat{A}}_{i} = \frac{r_{i} - mean (r_{1}, . . ., r_{G})}{std (r_{1}, . . ., r_{G})}$

# Abstract GRPO training step
completions = model.fast_generate(prompts, n=num_generations)
rewards = [reward_fn(p, c) for p, c in zip(prompts, completions)]

# Group-relative advantage
for group in groups:
    mean_r = mean(group.rewards)
    std_r = std(group.rewards)
    group.advantages = [(r - mean_r) / std_r for r in group.rewards]

# Policy gradient with clipping
loss = clipped_policy_gradient(model, completions, advantages) + beta * kl_penalty

Related Pages

Implemented By

Implementation:Unslothai_Unsloth_PatchFastRL_GRPOTrainer

Uses Heuristic

Heuristic:Unslothai_Unsloth_VLLM_Memory_Utilization

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment