Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Volcengine Verl Compute GRPO Outcome Advantage

From Leeroopedia


Knowledge Sources
Domains Reinforcement_Learning, Advantage_Estimation
Last Updated 2026-02-07 14:00 GMT

Overview

Concrete tool for computing group-relative advantage estimates without a critic network, provided by the verl library.

Description

The compute_grpo_outcome_advantage function computes advantage estimates by normalizing rewards within groups of responses generated from the same prompt. Each group's advantages are computed as the reward minus group mean, optionally divided by group standard deviation. This eliminates the need for a learned value function (critic).

Usage

Import this function when using GRPO as the advantage estimator (algorithm.adv_estimator=grpo). It is called automatically by the compute_advantage orchestrator in the training loop.

Code Reference

Source Location

  • Repository: verl
  • File: verl/trainer/ppo/core_algos.py
  • Lines: 267-332

Signature

@register_adv_est(AdvantageEstimator.GRPO)
def compute_grpo_outcome_advantage(
    token_level_rewards: torch.Tensor,
    response_mask: torch.Tensor,
    index: np.ndarray,
    epsilon: float = 1e-6,
    norm_adv_by_std_in_grpo: bool = True,
    config: Optional[AlgoConfig] = None,
) -> tuple[torch.Tensor, torch.Tensor]:
    """
    Compute GRPO outcome-level advantages.

    Args:
        token_level_rewards: Per-token rewards (bs, response_length).
        response_mask: Binary mask for response tokens (bs, response_length).
        index: Group indices mapping responses to prompts (bs,).
        epsilon: Small constant for numerical stability.
        norm_adv_by_std_in_grpo: Whether to normalize by group std.
        config: Algorithm configuration.

    Returns:
        Tuple of (advantages, advantages) - same tensor returned twice for API compatibility.
    """

Import

from verl.trainer.ppo.core_algos import compute_grpo_outcome_advantage

I/O Contract

Inputs

Name Type Required Description
token_level_rewards torch.Tensor Yes Per-token rewards tensor of shape (batch_size, response_length)
response_mask torch.Tensor Yes Binary mask indicating response tokens (batch_size, response_length)
index np.ndarray Yes Group indices mapping each response to its prompt (batch_size,)
epsilon float No Numerical stability constant (default: 1e-6)
norm_adv_by_std_in_grpo bool No Whether to divide by group std (default: True)
config Optional[AlgoConfig] No Algorithm configuration

Outputs

Name Type Description
advantages torch.Tensor Computed advantages (batch_size, response_length)
returns torch.Tensor Same as advantages (for API compatibility)

Usage Examples

import torch
import numpy as np
from verl.trainer.ppo.core_algos import compute_grpo_outcome_advantage

# Simulated data: 2 prompts, 4 responses each (group size = 4)
batch_size = 8
response_length = 128

token_level_rewards = torch.zeros(batch_size, response_length)
# Set final token rewards
token_level_rewards[0, -1] = 1.0   # Prompt 0, response 0: correct
token_level_rewards[1, -1] = 0.0   # Prompt 0, response 1: incorrect
token_level_rewards[2, -1] = 1.0   # Prompt 0, response 2: correct
token_level_rewards[3, -1] = 0.0   # Prompt 0, response 3: incorrect
token_level_rewards[4, -1] = 0.0   # Prompt 1, all incorrect
token_level_rewards[5, -1] = 0.0
token_level_rewards[6, -1] = 0.0
token_level_rewards[7, -1] = 1.0   # Prompt 1, response 3: correct

response_mask = torch.ones(batch_size, response_length)
index = np.array([0, 0, 0, 0, 1, 1, 1, 1])  # Group assignments

advantages, returns = compute_grpo_outcome_advantage(
    token_level_rewards=token_level_rewards,
    response_mask=response_mask,
    index=index,
    norm_adv_by_std_in_grpo=True,
)

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment