Principle:Allenai Open instruct GRPO Loss

Knowledge Sources	DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models DAPO: An Open-Source LLM Reinforcement Learning System Proximal Policy Optimization Algorithms
Domains	Reinforcement Learning Optimization
Last Updated	2026-02-07 00:00 GMT

Overview

The GRPO loss is a clipped policy gradient objective that uses group-relative advantage normalization to optimize language model policies without requiring a separate value function.

Description

Group Relative Policy Optimization (GRPO) adapts the PPO clipping mechanism for language model RL training. Unlike standard PPO, which requires a learned value function (critic) to compute advantages, GRPO computes advantages by comparing the rewards of multiple completions sampled from the same prompt. This eliminates the critic entirely, reducing memory and compute requirements.

The loss function has two main variants implemented in Open Instruct:

DAPO (Dynamic Asymmetric Policy Optimization): Uses asymmetric clipping where the upper and lower clip bounds can differ. This allows more aggressive exploration (higher upper clip) while maintaining conservative updates (lower clip).
CISPO (Clipped Importance Sampling Policy Optimization): Clips the importance sampling ratio directly and multiplies by the log-probability, providing a REINFORCE-style gradient with bounded updates.

Both variants include an optional KL penalty term that penalizes the policy for diverging too far from a reference model, which helps prevent reward hacking and maintain generation quality.

Usage

The GRPO loss is computed inside the training step for each mini-batch of packed sequences. It is the core optimization objective that drives policy improvement. The choice between DAPO and CISPO, and the tuning of clipping parameters, affects the stability-exploration tradeoff.

Theoretical Basis

DAPO Loss

The DAPO loss follows the PPO clipping framework with group-relative advantages:

ratio_t = exp(log_pi_new(a_t | s_t) - log_pi_old(a_t | s_t))

L_unclipped = -advantage_t * ratio_t
L_clipped   = -advantage_t * clip(ratio_t, 1 - clip_lower, 1 + clip_higher)
L_DAPO      = max(L_unclipped, L_clipped)

When clip_lower = clip_higher = epsilon, this reduces to standard PPO-clip. The asymmetric clipping (clip_higher > clip_lower) is a key DAPO innovation that allows the policy to take larger steps in the direction of improvement while limiting regression.

CISPO Loss

The CISPO variant clips the ratio directly without the max operation:

clipped_ratio = clip(ratio_t, max=1 + clip_higher)  # no lower bound
L_CISPO = -advantage_t * clipped_ratio.detach() * log_pi_new(a_t | s_t)

The .detach() on the clipped ratio means gradients only flow through the log-probability, making this a REINFORCE-style estimator with bounded importance weights.

KL Penalty

When a reference model is loaded, a KL divergence penalty is added:

ref_diff = clamp(log_pi_new - log_pi_ref, -40, 40)
kl = estimate_kl(ref_diff, ratio)    # Multiple estimators available (0-3)

total_loss = L_policy + beta * kl

The KL estimator selection (0, 1, 2, or 3) corresponds to different approximations of the KL divergence, with estimator 2 being the default. The clamping prevents numerical instability from extreme log-probability differences.

Group-Relative Advantages

The advantages are computed per prompt group rather than requiring a value function:

For each prompt p with K completions:
    scores_p = [reward(completion_k) for k in range(K)]

    Standard normalization:
        advantage_k = (scores_p[k] - mean(scores_p)) / (std(scores_p) + 1e-8)

    Centered normalization:
        advantage_k = scores_p[k] - mean(scores_p)

This approach requires no additional model parameters and provides a natural baseline (the group mean) that reduces variance.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment