Principle:OpenRLHF OpenRLHF PPO Policy Loss
| Knowledge Sources | |
|---|---|
| Domains | Reinforcement_Learning, Loss_Functions |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
A clipped surrogate objective that optimizes a policy by constraining probability ratio changes, preventing destructively large policy updates.
Description
PPO Policy Loss implements the clipped surrogate objective from Proximal Policy Optimization. It computes the probability ratio between the current and old policy, clips it to a narrow range, and takes the minimum with the unclipped objective to ensure conservative policy updates. OpenRLHF extends this with dual-clip PPO (additional lower bound for negative advantages), GSPO (sequence-level ratios), and vLLM importance sampling corrections.
Usage
Used as the actor loss in PPO and Math-GRPO training. Instantiated by the PPO trainer with configurable clip epsilon, dual-clip threshold, and loss type.
Theoretical Basis
Standard PPO-Clip:
where and is the advantage.
Dual-clip: Adds a lower bound for negative advantages:
Related Pages
Implemented By