Principle:Alibaba ROLL Segment Masked Policy Optimization

Knowledge Sources	PPO GiGPO Alibaba ROLL
Domains	Reinforcement_Learning, Agentic_AI
Last Updated	2026-02-07 20:00 GMT

Overview

A policy optimization principle that supports both token-level and segment-level PPO ratio computation for multi-turn agentic RL training.

Description

Segment Masked Policy Optimization extends standard PPO with support for segment-level policy ratios (GSPO). In multi-turn trajectories, each turn produces a response segment. Rather than computing importance ratios per-token, segment-level computation averages log-probability ratios within each segment, then applies PPO clipping at the segment level. This approach better captures the structure of multi-turn dialogue.

The loss function also supports:

Asymmetric PPO clipping: Different clip ranges for positive and negative advantages
Dual clipping: Additional clipping for large negative advantages
KL penalty: Divergence penalty with reference model
Entropy regularization: Encouraging exploration
Train/infer correction: Correcting for log-probability discrepancies between training and inference

Usage

Use this principle during the policy update step of agentic RL training. Select ratio_type="segment" for GSPO-style training or ratio_type="token" for standard PPO.

Theoretical Basis

Segment-Level PPO (GSPO)

$r_{s e g m e n t} (θ) = \exp (\frac{1}{| S |} \sum_{t \in S} \log π_{θ} (a_{t} | s_{t}) - \log π_{θ_{o l d}} (a_{t} | s_{t}))$

Where $S$ is the set of tokens in a response segment.

Asymmetric Clipping

$L = \min (r \hat{A}, clip (r, 1 - ϵ_{l o w}, 1 + ϵ_{h i g h}) \hat{A})$

Related Pages

Implemented By

Implementation:Alibaba_ROLL_Agentic_ActorWorker_Loss_Func

Related Heuristics

The following heuristics inform this principle:

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment