Implementation:Alibaba ROLL Compute Advantage

Knowledge Sources	Alibaba ROLL
Domains	Reinforcement_Learning, Optimization
Last Updated	2026-02-07 20:00 GMT

Overview

Concrete advantage estimation and KL penalty utility functions provided by the Alibaba ROLL library.

Description

The compute_advantage, compute_token_reward, and reward_postprocess functions in roll/utils/functionals.py implement the full advantage estimation pipeline. reward_postprocess normalizes and clips response-level rewards. compute_token_reward adds token-level KL divergence penalties. compute_advantage computes the final per-token advantages using configurable estimators (GAE, GRPO, Reinforce++).

Usage

These functions are called sequentially in the RLVR pipeline's training loop after reward computation and before the policy optimization step.

Code Reference

Source Location

Repository: Alibaba ROLL
File: roll/utils/functionals.py
Lines: L596-830

Signature

def compute_token_reward(
    data: DataProto,
    pipeline_config: PPOConfig,
    kl_ctrl: AdaptiveKLController
) -> Tuple[DataProto, Dict[str, float]]:
    """
    Compute token-level rewards with KL divergence penalty.

    Args:
        data: DataProto with response_level_rewards, old_log_probs, ref_log_probs
        pipeline_config: Config with KL penalty settings
        kl_ctrl: Adaptive KL controller

    Returns:
        (Modified DataProto with token_level_rewards, metrics dict)
    """

@torch.no_grad()
def reward_postprocess(
    data: DataProto,
    pipeline_config: RLVRConfig,
    running_ctrl
) -> Tuple[DataProto, Dict[str, float]]:
    """
    Post-process response-level rewards with normalization and clipping.

    Args:
        data: DataProto with response_level_rewards
        pipeline_config: RLVR config with normalization settings
        running_ctrl: Running statistics controller

    Returns:
        (Modified DataProto with normalized rewards, metrics dict)
    """

def compute_advantage(
    data: DataProto,
    gamma: float,
    lambd: float,
    adv_estimator: str,
    advantage_clip: Optional[float] = None,
    whiten_advantages: bool = False,
    whiten_rewards: bool = False,
    response_mask: Optional[torch.Tensor] = None,
) -> DataProto:
    """
    Compute advantages and returns for policy gradient methods.

    Args:
        data: DataProto with token_level_rewards, optionally values
        gamma: Discount factor
        lambd: GAE lambda
        adv_estimator: "gae", "reinforce", "grpo", "gigpo", "step_reinforce"
        advantage_clip: Clip advantages to [-clip, clip]
        whiten_advantages: Apply whitening to advantages
        whiten_rewards: Apply whitening to rewards
        response_mask: Mask for response tokens

    Returns:
        DataProto with advantages, returns, raw_advantages
    """

Import

from roll.utils.functionals import compute_advantage, compute_token_reward, reward_postprocess

I/O Contract

Inputs

Name	Type	Required	Description
data	DataProto	Yes	Batch with response_level_rewards, old_log_probs, ref_log_probs, response_mask
gamma	float	Yes	Discount factor (typically 1.0 for RLVR)
lambd	float	Yes	GAE lambda parameter
adv_estimator	str	Yes	Advantage estimation method ("gae", "grpo", "reinforce")
pipeline_config	PPOConfig	Yes	Configuration with KL penalty and normalization settings
kl_ctrl	AdaptiveKLController	Yes	Adaptive KL coefficient controller

Outputs

Name	Type	Description
advantages	torch.Tensor	Per-token advantage estimates
returns	torch.Tensor	Per-token return estimates
token_level_rewards	torch.Tensor	KL-penalized token-level rewards
metrics	Dict[str, float]	KL statistics, clip fractions, reward statistics

Usage Examples

Full Advantage Pipeline

from roll.utils.functionals import compute_advantage, compute_token_reward, reward_postprocess

# Step 1: Post-process response-level rewards
data, reward_metrics = reward_postprocess(data, pipeline_config, running_ctrl)

# Step 2: Add token-level KL penalty
data, kl_metrics = compute_token_reward(data, pipeline_config, kl_ctrl)

# Step 3: Compute advantages
data = compute_advantage(
    data=data,
    gamma=1.0,
    lambd=1.0,
    adv_estimator="grpo",
    advantage_clip=5.0,
    whiten_advantages=True,
)

# Access results
advantages = data.batch["advantages"]
returns = data.batch["returns"]

Related Pages

Implements Principle

Principle:Alibaba_ROLL_Advantage_Estimation_with_KL_Penalty

Requires Environment

Environment Dependencies

This implementation requires the following environment constraints:

Environment:Alibaba_ROLL_CUDA_GPU_Environment

Heuristics Applied

This implementation uses the following heuristics:

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment