Implementation:Alibaba ROLL Agentic Compute Advantage

Knowledge Sources	Alibaba ROLL
Domains	Reinforcement_Learning, Agentic_AI
Last Updated	2026-02-07 20:00 GMT

Overview

Concrete agentic advantage estimation function with multi-estimator support provided by the Alibaba ROLL library.

Description

The agentic_compute_advantage function computes per-token advantages for agentic RL training. It supports six estimators (GAE, Reinforce, GRPO, GiGPO, Step-Reinforce, Agentic-Reinforce) and handles advantage whitening, clipping, and segment-aware computation.

Usage

Called by the agentic pipeline after reward computation and before the policy optimization step.

Code Reference

Source Location

Repository: Alibaba ROLL
File: roll/pipeline/agentic/utils.py
Lines: L456-509

Signature

@torch.no_grad()
def agentic_compute_advantage(
    data: DataProto,
    gamma: float,
    lambd: float,
    adv_estimator: str,
    advantage_clip: Optional[float] = None,
    whiten_advantages: bool = False,
    whiten_rewards: bool = False,
    response_mask: Optional[torch.Tensor] = None,
) -> DataProto:
    """
    Compute advantages and returns using specified estimator.

    Args:
        data: DataProto with token_level_rewards, optionally values
        gamma: Discount factor
        lambd: Lambda for GAE
        adv_estimator: "gae"/"reinforce"/"grpo"/"gigpo"/"step_reinforce"/"agentic_reinforce"
        advantage_clip: Clip advantages to [-clip, +clip]
        whiten_advantages: Apply advantage whitening
        whiten_rewards: Apply reward whitening
        response_mask: Mask for valid positions

    Returns:
        DataProto with raw_advantages, advantages, returns
    """

Import

from roll.pipeline.agentic.utils import agentic_compute_advantage

I/O Contract

Inputs

Name	Type	Required	Description
data	DataProto	Yes	Batch with token_level_rewards, response_mask, optionally values
gamma	float	Yes	Discount factor
lambd	float	Yes	GAE lambda parameter
adv_estimator	str	Yes	Advantage estimator identifier

Outputs

Name	Type	Description
advantages	torch.Tensor	Clipped advantages per token
raw_advantages	torch.Tensor	Unclipped advantages
returns	torch.Tensor	Computed returns per token

Usage Examples

from roll.pipeline.agentic.utils import agentic_compute_advantage

data = agentic_compute_advantage(
    data=batch_with_rewards,
    gamma=1.0,
    lambd=1.0,
    adv_estimator="gigpo",
    advantage_clip=5.0,
    whiten_advantages=True,
)

advantages = data.batch["advantages"]
returns = data.batch["returns"]

Related Pages

Implements Principle

Principle:Alibaba_ROLL_Agentic_Advantage_Estimation

Requires Environment

Environment Dependencies

This implementation requires the following environment constraints:

Environment:Alibaba_ROLL_CUDA_GPU_Environment

Heuristics Applied

This implementation uses the following heuristics:

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment