Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Alibaba ROLL Compute Advantage

From Leeroopedia


Knowledge Sources
Domains Reinforcement_Learning, Optimization
Last Updated 2026-02-07 20:00 GMT

Overview

Concrete advantage estimation and KL penalty utility functions provided by the Alibaba ROLL library.

Description

The compute_advantage, compute_token_reward, and reward_postprocess functions in roll/utils/functionals.py implement the full advantage estimation pipeline. reward_postprocess normalizes and clips response-level rewards. compute_token_reward adds token-level KL divergence penalties. compute_advantage computes the final per-token advantages using configurable estimators (GAE, GRPO, Reinforce++).

Usage

These functions are called sequentially in the RLVR pipeline's training loop after reward computation and before the policy optimization step.

Code Reference

Source Location

  • Repository: Alibaba ROLL
  • File: roll/utils/functionals.py
  • Lines: L596-830

Signature

def compute_token_reward(
    data: DataProto,
    pipeline_config: PPOConfig,
    kl_ctrl: AdaptiveKLController
) -> Tuple[DataProto, Dict[str, float]]:
    """
    Compute token-level rewards with KL divergence penalty.

    Args:
        data: DataProto with response_level_rewards, old_log_probs, ref_log_probs
        pipeline_config: Config with KL penalty settings
        kl_ctrl: Adaptive KL controller

    Returns:
        (Modified DataProto with token_level_rewards, metrics dict)
    """

@torch.no_grad()
def reward_postprocess(
    data: DataProto,
    pipeline_config: RLVRConfig,
    running_ctrl
) -> Tuple[DataProto, Dict[str, float]]:
    """
    Post-process response-level rewards with normalization and clipping.

    Args:
        data: DataProto with response_level_rewards
        pipeline_config: RLVR config with normalization settings
        running_ctrl: Running statistics controller

    Returns:
        (Modified DataProto with normalized rewards, metrics dict)
    """

def compute_advantage(
    data: DataProto,
    gamma: float,
    lambd: float,
    adv_estimator: str,
    advantage_clip: Optional[float] = None,
    whiten_advantages: bool = False,
    whiten_rewards: bool = False,
    response_mask: Optional[torch.Tensor] = None,
) -> DataProto:
    """
    Compute advantages and returns for policy gradient methods.

    Args:
        data: DataProto with token_level_rewards, optionally values
        gamma: Discount factor
        lambd: GAE lambda
        adv_estimator: "gae", "reinforce", "grpo", "gigpo", "step_reinforce"
        advantage_clip: Clip advantages to [-clip, clip]
        whiten_advantages: Apply whitening to advantages
        whiten_rewards: Apply whitening to rewards
        response_mask: Mask for response tokens

    Returns:
        DataProto with advantages, returns, raw_advantages
    """

Import

from roll.utils.functionals import compute_advantage, compute_token_reward, reward_postprocess

I/O Contract

Inputs

Name Type Required Description
data DataProto Yes Batch with response_level_rewards, old_log_probs, ref_log_probs, response_mask
gamma float Yes Discount factor (typically 1.0 for RLVR)
lambd float Yes GAE lambda parameter
adv_estimator str Yes Advantage estimation method ("gae", "grpo", "reinforce")
pipeline_config PPOConfig Yes Configuration with KL penalty and normalization settings
kl_ctrl AdaptiveKLController Yes Adaptive KL coefficient controller

Outputs

Name Type Description
advantages torch.Tensor Per-token advantage estimates
returns torch.Tensor Per-token return estimates
token_level_rewards torch.Tensor KL-penalized token-level rewards
metrics Dict[str, float] KL statistics, clip fractions, reward statistics

Usage Examples

Full Advantage Pipeline

from roll.utils.functionals import compute_advantage, compute_token_reward, reward_postprocess

# Step 1: Post-process response-level rewards
data, reward_metrics = reward_postprocess(data, pipeline_config, running_ctrl)

# Step 2: Add token-level KL penalty
data, kl_metrics = compute_token_reward(data, pipeline_config, kl_ctrl)

# Step 3: Compute advantages
data = compute_advantage(
    data=data,
    gamma=1.0,
    lambd=1.0,
    adv_estimator="grpo",
    advantage_clip=5.0,
    whiten_advantages=True,
)

# Access results
advantages = data.batch["advantages"]
returns = data.batch["returns"]

Related Pages

Implements Principle

Requires Environment

Environment Dependencies

This implementation requires the following environment constraints:

Heuristics Applied

This implementation uses the following heuristics:

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment