Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Alibaba ROLL Segment Masked Policy Optimization

From Leeroopedia


Knowledge Sources
Domains Reinforcement_Learning, Agentic_AI
Last Updated 2026-02-07 20:00 GMT

Overview

A policy optimization principle that supports both token-level and segment-level PPO ratio computation for multi-turn agentic RL training.

Description

Segment Masked Policy Optimization extends standard PPO with support for segment-level policy ratios (GSPO). In multi-turn trajectories, each turn produces a response segment. Rather than computing importance ratios per-token, segment-level computation averages log-probability ratios within each segment, then applies PPO clipping at the segment level. This approach better captures the structure of multi-turn dialogue.

The loss function also supports:

  • Asymmetric PPO clipping: Different clip ranges for positive and negative advantages
  • Dual clipping: Additional clipping for large negative advantages
  • KL penalty: Divergence penalty with reference model
  • Entropy regularization: Encouraging exploration
  • Train/infer correction: Correcting for log-probability discrepancies between training and inference

Usage

Use this principle during the policy update step of agentic RL training. Select ratio_type="segment" for GSPO-style training or ratio_type="token" for standard PPO.

Theoretical Basis

Segment-Level PPO (GSPO)

rsegment(θ)=exp(1|S|tSlogπθ(at|st)logπθold(at|st))

Where S is the set of tokens in a response segment.

Asymmetric Clipping

L=min(rA^,clip(r,1ϵlow,1+ϵhigh)A^)

Related Pages

Implemented By

Related Heuristics

The following heuristics inform this principle:

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment