Implementation:Volcengine Verl Compute GRPO Outcome Advantage

Knowledge Sources	verl DeepSeekMath
Domains	Reinforcement_Learning, Advantage_Estimation
Last Updated	2026-02-07 14:00 GMT

Overview

Concrete tool for computing group-relative advantage estimates without a critic network, provided by the verl library.

Description

The compute_grpo_outcome_advantage function computes advantage estimates by normalizing rewards within groups of responses generated from the same prompt. Each group's advantages are computed as the reward minus group mean, optionally divided by group standard deviation. This eliminates the need for a learned value function (critic).

Usage

Import this function when using GRPO as the advantage estimator (algorithm.adv_estimator=grpo). It is called automatically by the compute_advantage orchestrator in the training loop.

Code Reference

Source Location

Repository: verl
File: verl/trainer/ppo/core_algos.py
Lines: 267-332

Signature

@register_adv_est(AdvantageEstimator.GRPO)
def compute_grpo_outcome_advantage(
    token_level_rewards: torch.Tensor,
    response_mask: torch.Tensor,
    index: np.ndarray,
    epsilon: float = 1e-6,
    norm_adv_by_std_in_grpo: bool = True,
    config: Optional[AlgoConfig] = None,
) -> tuple[torch.Tensor, torch.Tensor]:
    """
    Compute GRPO outcome-level advantages.

    Args:
        token_level_rewards: Per-token rewards (bs, response_length).
        response_mask: Binary mask for response tokens (bs, response_length).
        index: Group indices mapping responses to prompts (bs,).
        epsilon: Small constant for numerical stability.
        norm_adv_by_std_in_grpo: Whether to normalize by group std.
        config: Algorithm configuration.

    Returns:
        Tuple of (advantages, advantages) - same tensor returned twice for API compatibility.
    """

Import

from verl.trainer.ppo.core_algos import compute_grpo_outcome_advantage

I/O Contract

Inputs

Name	Type	Required	Description
token_level_rewards	torch.Tensor	Yes	Per-token rewards tensor of shape (batch_size, response_length)
response_mask	torch.Tensor	Yes	Binary mask indicating response tokens (batch_size, response_length)
index	np.ndarray	Yes	Group indices mapping each response to its prompt (batch_size,)
epsilon	float	No	Numerical stability constant (default: 1e-6)
norm_adv_by_std_in_grpo	bool	No	Whether to divide by group std (default: True)
config	Optional[AlgoConfig]	No	Algorithm configuration

Outputs

Name	Type	Description
advantages	torch.Tensor	Computed advantages (batch_size, response_length)
returns	torch.Tensor	Same as advantages (for API compatibility)

Usage Examples

import torch
import numpy as np
from verl.trainer.ppo.core_algos import compute_grpo_outcome_advantage

# Simulated data: 2 prompts, 4 responses each (group size = 4)
batch_size = 8
response_length = 128

token_level_rewards = torch.zeros(batch_size, response_length)
# Set final token rewards
token_level_rewards[0, -1] = 1.0   # Prompt 0, response 0: correct
token_level_rewards[1, -1] = 0.0   # Prompt 0, response 1: incorrect
token_level_rewards[2, -1] = 1.0   # Prompt 0, response 2: correct
token_level_rewards[3, -1] = 0.0   # Prompt 0, response 3: incorrect
token_level_rewards[4, -1] = 0.0   # Prompt 1, all incorrect
token_level_rewards[5, -1] = 0.0
token_level_rewards[6, -1] = 0.0
token_level_rewards[7, -1] = 1.0   # Prompt 1, response 3: correct

response_mask = torch.ones(batch_size, response_length)
index = np.array([0, 0, 0, 0, 1, 1, 1, 1])  # Group assignments

advantages, returns = compute_grpo_outcome_advantage(
    token_level_rewards=token_level_rewards,
    response_mask=response_mask,
    index=index,
    norm_adv_by_std_in_grpo=True,
)

Related Pages

Implements Principle

Principle:Volcengine_Verl_GRPO_Advantage_Estimation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment