Implementation:Volcengine Verl Compute GRPO Outcome Advantage
| Knowledge Sources | |
|---|---|
| Domains | Reinforcement_Learning, Advantage_Estimation |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
Concrete tool for computing group-relative advantage estimates without a critic network, provided by the verl library.
Description
The compute_grpo_outcome_advantage function computes advantage estimates by normalizing rewards within groups of responses generated from the same prompt. Each group's advantages are computed as the reward minus group mean, optionally divided by group standard deviation. This eliminates the need for a learned value function (critic).
Usage
Import this function when using GRPO as the advantage estimator (algorithm.adv_estimator=grpo). It is called automatically by the compute_advantage orchestrator in the training loop.
Code Reference
Source Location
- Repository: verl
- File: verl/trainer/ppo/core_algos.py
- Lines: 267-332
Signature
@register_adv_est(AdvantageEstimator.GRPO)
def compute_grpo_outcome_advantage(
token_level_rewards: torch.Tensor,
response_mask: torch.Tensor,
index: np.ndarray,
epsilon: float = 1e-6,
norm_adv_by_std_in_grpo: bool = True,
config: Optional[AlgoConfig] = None,
) -> tuple[torch.Tensor, torch.Tensor]:
"""
Compute GRPO outcome-level advantages.
Args:
token_level_rewards: Per-token rewards (bs, response_length).
response_mask: Binary mask for response tokens (bs, response_length).
index: Group indices mapping responses to prompts (bs,).
epsilon: Small constant for numerical stability.
norm_adv_by_std_in_grpo: Whether to normalize by group std.
config: Algorithm configuration.
Returns:
Tuple of (advantages, advantages) - same tensor returned twice for API compatibility.
"""
Import
from verl.trainer.ppo.core_algos import compute_grpo_outcome_advantage
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| token_level_rewards | torch.Tensor | Yes | Per-token rewards tensor of shape (batch_size, response_length) |
| response_mask | torch.Tensor | Yes | Binary mask indicating response tokens (batch_size, response_length) |
| index | np.ndarray | Yes | Group indices mapping each response to its prompt (batch_size,) |
| epsilon | float | No | Numerical stability constant (default: 1e-6) |
| norm_adv_by_std_in_grpo | bool | No | Whether to divide by group std (default: True) |
| config | Optional[AlgoConfig] | No | Algorithm configuration |
Outputs
| Name | Type | Description |
|---|---|---|
| advantages | torch.Tensor | Computed advantages (batch_size, response_length) |
| returns | torch.Tensor | Same as advantages (for API compatibility) |
Usage Examples
import torch
import numpy as np
from verl.trainer.ppo.core_algos import compute_grpo_outcome_advantage
# Simulated data: 2 prompts, 4 responses each (group size = 4)
batch_size = 8
response_length = 128
token_level_rewards = torch.zeros(batch_size, response_length)
# Set final token rewards
token_level_rewards[0, -1] = 1.0 # Prompt 0, response 0: correct
token_level_rewards[1, -1] = 0.0 # Prompt 0, response 1: incorrect
token_level_rewards[2, -1] = 1.0 # Prompt 0, response 2: correct
token_level_rewards[3, -1] = 0.0 # Prompt 0, response 3: incorrect
token_level_rewards[4, -1] = 0.0 # Prompt 1, all incorrect
token_level_rewards[5, -1] = 0.0
token_level_rewards[6, -1] = 0.0
token_level_rewards[7, -1] = 1.0 # Prompt 1, response 3: correct
response_mask = torch.ones(batch_size, response_length)
index = np.array([0, 0, 0, 0, 1, 1, 1, 1]) # Group assignments
advantages, returns = compute_grpo_outcome_advantage(
token_level_rewards=token_level_rewards,
response_mask=response_mask,
index=index,
norm_adv_by_std_in_grpo=True,
)