Principle:Alibaba ROLL Segment Masked Policy Optimization
| Knowledge Sources | |
|---|---|
| Domains | Reinforcement_Learning, Agentic_AI |
| Last Updated | 2026-02-07 20:00 GMT |
Overview
A policy optimization principle that supports both token-level and segment-level PPO ratio computation for multi-turn agentic RL training.
Description
Segment Masked Policy Optimization extends standard PPO with support for segment-level policy ratios (GSPO). In multi-turn trajectories, each turn produces a response segment. Rather than computing importance ratios per-token, segment-level computation averages log-probability ratios within each segment, then applies PPO clipping at the segment level. This approach better captures the structure of multi-turn dialogue.
The loss function also supports:
- Asymmetric PPO clipping: Different clip ranges for positive and negative advantages
- Dual clipping: Additional clipping for large negative advantages
- KL penalty: Divergence penalty with reference model
- Entropy regularization: Encouraging exploration
- Train/infer correction: Correcting for log-probability discrepancies between training and inference
Usage
Use this principle during the policy update step of agentic RL training. Select ratio_type="segment" for GSPO-style training or ratio_type="token" for standard PPO.
Theoretical Basis
Segment-Level PPO (GSPO)
Where is the set of tokens in a response segment.
Asymmetric Clipping
Related Pages
Implemented By
Related Heuristics
The following heuristics inform this principle: