Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Alibaba ROLL Agentic Compute Advantage

From Leeroopedia


Knowledge Sources
Domains Reinforcement_Learning, Agentic_AI
Last Updated 2026-02-07 20:00 GMT

Overview

Concrete agentic advantage estimation function with multi-estimator support provided by the Alibaba ROLL library.

Description

The agentic_compute_advantage function computes per-token advantages for agentic RL training. It supports six estimators (GAE, Reinforce, GRPO, GiGPO, Step-Reinforce, Agentic-Reinforce) and handles advantage whitening, clipping, and segment-aware computation.

Usage

Called by the agentic pipeline after reward computation and before the policy optimization step.

Code Reference

Source Location

  • Repository: Alibaba ROLL
  • File: roll/pipeline/agentic/utils.py
  • Lines: L456-509

Signature

@torch.no_grad()
def agentic_compute_advantage(
    data: DataProto,
    gamma: float,
    lambd: float,
    adv_estimator: str,
    advantage_clip: Optional[float] = None,
    whiten_advantages: bool = False,
    whiten_rewards: bool = False,
    response_mask: Optional[torch.Tensor] = None,
) -> DataProto:
    """
    Compute advantages and returns using specified estimator.

    Args:
        data: DataProto with token_level_rewards, optionally values
        gamma: Discount factor
        lambd: Lambda for GAE
        adv_estimator: "gae"/"reinforce"/"grpo"/"gigpo"/"step_reinforce"/"agentic_reinforce"
        advantage_clip: Clip advantages to [-clip, +clip]
        whiten_advantages: Apply advantage whitening
        whiten_rewards: Apply reward whitening
        response_mask: Mask for valid positions

    Returns:
        DataProto with raw_advantages, advantages, returns
    """

Import

from roll.pipeline.agentic.utils import agentic_compute_advantage

I/O Contract

Inputs

Name Type Required Description
data DataProto Yes Batch with token_level_rewards, response_mask, optionally values
gamma float Yes Discount factor
lambd float Yes GAE lambda parameter
adv_estimator str Yes Advantage estimator identifier

Outputs

Name Type Description
advantages torch.Tensor Clipped advantages per token
raw_advantages torch.Tensor Unclipped advantages
returns torch.Tensor Computed returns per token

Usage Examples

from roll.pipeline.agentic.utils import agentic_compute_advantage

data = agentic_compute_advantage(
    data=batch_with_rewards,
    gamma=1.0,
    lambd=1.0,
    adv_estimator="gigpo",
    advantage_clip=5.0,
    whiten_advantages=True,
)

advantages = data.batch["advantages"]
returns = data.batch["returns"]

Related Pages

Implements Principle

Requires Environment

Environment Dependencies

This implementation requires the following environment constraints:

Heuristics Applied

This implementation uses the following heuristics:

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment