Principle:Volcengine Verl GRPO Advantage Estimation
| Knowledge Sources | |
|---|---|
| Domains | Reinforcement_Learning, Policy_Optimization, Advantage_Estimation |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
An advantage estimation method that normalizes rewards within groups of sampled responses to the same prompt, eliminating the need for a learned critic or value function.
Description
Group Relative Policy Optimization (GRPO) advantage estimation addresses a key limitation of standard PPO: the requirement for a separate critic (value function) network. Instead of learning a value baseline, GRPO generates multiple responses (a "group") per prompt and computes advantages by normalizing the rewards within each group using the group mean and standard deviation.
This approach has several benefits:
- Eliminates the need to train and maintain a critic model, halving memory requirements
- Provides a natural baseline through group statistics, reducing variance without bias from a learned value function
- Scales naturally with the number of samples per prompt (group size), with larger groups providing more stable estimates
The method was introduced alongside the DeepSeek-Math model and has become a popular alternative to GAE-based PPO for LLM training.
Usage
Use GRPO advantage estimation when training language models with reinforcement learning and:
- A critic-free setup is preferred (saves GPU memory and compute)
- Rule-based or simple reward functions are available (no need for learned reward models)
- Multiple completions per prompt can be generated efficiently (group size >= 2)
GRPO is the default advantage estimator in verl for most training workflows. It is preferred over GAE when the reward signal is clear and a critic model is not needed.
Theoretical Basis
The GRPO advantage for token in response from group is computed as:
Where:
- is the total reward for response
- is the mean reward across all responses in group
- is the standard deviation of rewards in group
- is a small constant for numerical stability
Key properties:
- The advantage is outcome-level — each token in a response gets the same advantage value (determined by the final reward)
- Normalization is performed per-group (per-prompt), not across the entire batch
- Standard deviation normalization can be optionally disabled via configuration
Pseudo-code:
# Abstract GRPO advantage computation
for each prompt group g:
rewards_g = [reward(response_i) for response_i in group_g]
mean_g = mean(rewards_g)
std_g = std(rewards_g)
for each response i in group_g:
advantage_i = (rewards_g[i] - mean_g) / (std_g + epsilon)
# Broadcast advantage to all tokens in response
token_advantages[i, :] = advantage_i * response_mask[i, :]