Implementation:Alibaba ROLL Agentic Compute Advantage
Appearance
| Knowledge Sources | |
|---|---|
| Domains | Reinforcement_Learning, Agentic_AI |
| Last Updated | 2026-02-07 20:00 GMT |
Overview
Concrete agentic advantage estimation function with multi-estimator support provided by the Alibaba ROLL library.
Description
The agentic_compute_advantage function computes per-token advantages for agentic RL training. It supports six estimators (GAE, Reinforce, GRPO, GiGPO, Step-Reinforce, Agentic-Reinforce) and handles advantage whitening, clipping, and segment-aware computation.
Usage
Called by the agentic pipeline after reward computation and before the policy optimization step.
Code Reference
Source Location
- Repository: Alibaba ROLL
- File: roll/pipeline/agentic/utils.py
- Lines: L456-509
Signature
@torch.no_grad()
def agentic_compute_advantage(
data: DataProto,
gamma: float,
lambd: float,
adv_estimator: str,
advantage_clip: Optional[float] = None,
whiten_advantages: bool = False,
whiten_rewards: bool = False,
response_mask: Optional[torch.Tensor] = None,
) -> DataProto:
"""
Compute advantages and returns using specified estimator.
Args:
data: DataProto with token_level_rewards, optionally values
gamma: Discount factor
lambd: Lambda for GAE
adv_estimator: "gae"/"reinforce"/"grpo"/"gigpo"/"step_reinforce"/"agentic_reinforce"
advantage_clip: Clip advantages to [-clip, +clip]
whiten_advantages: Apply advantage whitening
whiten_rewards: Apply reward whitening
response_mask: Mask for valid positions
Returns:
DataProto with raw_advantages, advantages, returns
"""
Import
from roll.pipeline.agentic.utils import agentic_compute_advantage
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| data | DataProto | Yes | Batch with token_level_rewards, response_mask, optionally values |
| gamma | float | Yes | Discount factor |
| lambd | float | Yes | GAE lambda parameter |
| adv_estimator | str | Yes | Advantage estimator identifier |
Outputs
| Name | Type | Description |
|---|---|---|
| advantages | torch.Tensor | Clipped advantages per token |
| raw_advantages | torch.Tensor | Unclipped advantages |
| returns | torch.Tensor | Computed returns per token |
Usage Examples
from roll.pipeline.agentic.utils import agentic_compute_advantage
data = agentic_compute_advantage(
data=batch_with_rewards,
gamma=1.0,
lambd=1.0,
adv_estimator="gigpo",
advantage_clip=5.0,
whiten_advantages=True,
)
advantages = data.batch["advantages"]
returns = data.batch["returns"]
Related Pages
Implements Principle
Requires Environment
Environment Dependencies
This implementation requires the following environment constraints:
Heuristics Applied
This implementation uses the following heuristics:
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment