Implementation:Alibaba ROLL Compute Advantage
| Knowledge Sources | |
|---|---|
| Domains | Reinforcement_Learning, Optimization |
| Last Updated | 2026-02-07 20:00 GMT |
Overview
Concrete advantage estimation and KL penalty utility functions provided by the Alibaba ROLL library.
Description
The compute_advantage, compute_token_reward, and reward_postprocess functions in roll/utils/functionals.py implement the full advantage estimation pipeline. reward_postprocess normalizes and clips response-level rewards. compute_token_reward adds token-level KL divergence penalties. compute_advantage computes the final per-token advantages using configurable estimators (GAE, GRPO, Reinforce++).
Usage
These functions are called sequentially in the RLVR pipeline's training loop after reward computation and before the policy optimization step.
Code Reference
Source Location
- Repository: Alibaba ROLL
- File: roll/utils/functionals.py
- Lines: L596-830
Signature
def compute_token_reward(
data: DataProto,
pipeline_config: PPOConfig,
kl_ctrl: AdaptiveKLController
) -> Tuple[DataProto, Dict[str, float]]:
"""
Compute token-level rewards with KL divergence penalty.
Args:
data: DataProto with response_level_rewards, old_log_probs, ref_log_probs
pipeline_config: Config with KL penalty settings
kl_ctrl: Adaptive KL controller
Returns:
(Modified DataProto with token_level_rewards, metrics dict)
"""
@torch.no_grad()
def reward_postprocess(
data: DataProto,
pipeline_config: RLVRConfig,
running_ctrl
) -> Tuple[DataProto, Dict[str, float]]:
"""
Post-process response-level rewards with normalization and clipping.
Args:
data: DataProto with response_level_rewards
pipeline_config: RLVR config with normalization settings
running_ctrl: Running statistics controller
Returns:
(Modified DataProto with normalized rewards, metrics dict)
"""
def compute_advantage(
data: DataProto,
gamma: float,
lambd: float,
adv_estimator: str,
advantage_clip: Optional[float] = None,
whiten_advantages: bool = False,
whiten_rewards: bool = False,
response_mask: Optional[torch.Tensor] = None,
) -> DataProto:
"""
Compute advantages and returns for policy gradient methods.
Args:
data: DataProto with token_level_rewards, optionally values
gamma: Discount factor
lambd: GAE lambda
adv_estimator: "gae", "reinforce", "grpo", "gigpo", "step_reinforce"
advantage_clip: Clip advantages to [-clip, clip]
whiten_advantages: Apply whitening to advantages
whiten_rewards: Apply whitening to rewards
response_mask: Mask for response tokens
Returns:
DataProto with advantages, returns, raw_advantages
"""
Import
from roll.utils.functionals import compute_advantage, compute_token_reward, reward_postprocess
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| data | DataProto | Yes | Batch with response_level_rewards, old_log_probs, ref_log_probs, response_mask |
| gamma | float | Yes | Discount factor (typically 1.0 for RLVR) |
| lambd | float | Yes | GAE lambda parameter |
| adv_estimator | str | Yes | Advantage estimation method ("gae", "grpo", "reinforce") |
| pipeline_config | PPOConfig | Yes | Configuration with KL penalty and normalization settings |
| kl_ctrl | AdaptiveKLController | Yes | Adaptive KL coefficient controller |
Outputs
| Name | Type | Description |
|---|---|---|
| advantages | torch.Tensor | Per-token advantage estimates |
| returns | torch.Tensor | Per-token return estimates |
| token_level_rewards | torch.Tensor | KL-penalized token-level rewards |
| metrics | Dict[str, float] | KL statistics, clip fractions, reward statistics |
Usage Examples
Full Advantage Pipeline
from roll.utils.functionals import compute_advantage, compute_token_reward, reward_postprocess
# Step 1: Post-process response-level rewards
data, reward_metrics = reward_postprocess(data, pipeline_config, running_ctrl)
# Step 2: Add token-level KL penalty
data, kl_metrics = compute_token_reward(data, pipeline_config, kl_ctrl)
# Step 3: Compute advantages
data = compute_advantage(
data=data,
gamma=1.0,
lambd=1.0,
adv_estimator="grpo",
advantage_clip=5.0,
whiten_advantages=True,
)
# Access results
advantages = data.batch["advantages"]
returns = data.batch["returns"]
Related Pages
Implements Principle
Requires Environment
Environment Dependencies
This implementation requires the following environment constraints:
Heuristics Applied
This implementation uses the following heuristics: