Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Alibaba ROLL PPO Clipping Defaults

From Leeroopedia



Knowledge Sources
Domains Reinforcement_Learning, Optimization
Last Updated 2026-02-07 19:00 GMT

Overview

Standard PPO clipping range of 0.2 for policy gradient loss, with dual-clip loss option for negative advantage stabilization and GAE lambda=0.95 / gamma=1.0 defaults.

Description

ROLL uses the standard PPO clipping defaults established in the original paper. The policy gradient clipping range of 0.2 constrains the importance sampling ratio to [0.8, 1.2], preventing destructively large policy updates. Asymmetric clipping (`pg_clip_low`/`pg_clip_high`) is available for fine-grained control. Additionally, a dual-clip loss variant prevents over-optimization on negative advantages by applying an extra upper bound of `(1 + pg_clip * 2) * advantages` when advantages are negative. The GAE parameters use lambda=0.95 (standard bias-variance tradeoff) and gamma=1.0 (no discounting, appropriate for episodic LLM generation tasks).

Usage

Apply these defaults as the starting point for all PPO-based training in ROLL (RLVR, Agentic RL pipelines). Adjust `pg_clip` downward (e.g., 0.1) for more conservative updates on smaller models, or upward (e.g., 0.3) for more aggressive exploration on larger models.

The Insight (Rule of Thumb)

  • Action: Use `pg_clip=0.2` as the default PPO clipping range.
  • Value: `pg_clip=0.2`, `lambd=0.95`, `gamma=1.0`, `max_grad_norm=1.0`.
  • Trade-off: Lower clip values (e.g., 0.1) give more stable but slower learning; higher values (e.g., 0.3) learn faster but risk instability.
  • Enhancement: Enable `dual_clip_loss=True` if observing reward collapse from negative advantage over-optimization.
  • Gradient Norm: Always clip gradients at norm=1.0 to prevent gradient explosions.

Reasoning

The PPO clipping range of 0.2 is the original paper's recommended value and has been validated across thousands of LLM RL experiments. The ratio [0.8, 1.2] allows enough policy change per step to make progress while preventing catastrophic forgetting. Lambda=0.95 in GAE provides a good balance between bias (low lambda) and variance (high lambda). Gamma=1.0 is appropriate because LLM response generation is episodic; there is no temporal discounting needed within a single response.

Code from `roll/configs/base_config.py:425-437`:

ppo_epochs: int = field(default=1)
max_grad_norm: float = field(default=1.0)
lambd: float = field(default=0.95)
gamma: float = field(default=1)
pg_clip: Optional[float] = field(default=0.2)
pg_clip_low: Optional[float] = field(default=0.2)
pg_clip_high: Optional[float] = field(default=0.2)

Dual clip loss from `roll/pipeline/base_worker.py:230-232`:

if self.pipeline_config.dual_clip_loss:
    dual_clip_loss = -torch.max(-pg_loss, (1 + self.pipeline_config.pg_clip * 2) * advantages)
    pg_loss = torch.where(advantages < 0, dual_clip_loss, pg_loss)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment