Principle:Volcengine Verl Policy Loss Optimization
| Knowledge Sources | |
|---|---|
| Domains | Reinforcement_Learning, Policy_Optimization, Deep_Learning |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
A clipped surrogate objective that constrains policy updates to a trust region, preventing destructively large steps during reinforcement learning training.
Description
Policy Loss Optimization in verl implements the PPO clipped surrogate objective, which is the central optimization target for both PPO and GRPO training. The clipping mechanism ensures that the policy does not change too dramatically in a single update, maintaining training stability.
The key idea is to compute the ratio of new policy probabilities to old policy probabilities for each token, then clip this ratio to a narrow range (typically [1-epsilon, 1+epsilon]). The final loss is the minimum of the unclipped and clipped objectives, which creates a pessimistic bound that penalizes large deviations from the old policy.
This principle also encompasses optional extensions:
- KL divergence penalty: An additional loss term penalizing divergence from a reference policy
- Dual-clip PPO: An extension that also clips from below when advantages are negative, preventing the policy from moving away from good actions
- Entropy bonus: Encouraging exploration by rewarding higher entropy in the policy distribution
Usage
Use this loss function whenever performing policy gradient updates in RL-based LLM training. It is the core optimization step in both GRPO and PPO workflows, applied after advantage estimation and before model weight updates.
The policy loss is computed on mini-batches of rollout data and backpropagated through the actor model.
Theoretical Basis
The PPO clipped surrogate objective is:
Where:
- is the probability ratio (computed in log-space as )
- is the estimated advantage (from GAE or GRPO)
- is the clip range (typically 0.2)
Optional extensions:
Dual-clip PPO:
Where is the dual-clip coefficient (typically 5.0).
KL penalty:
Pseudo-code:
# Abstract policy loss computation
ratio = exp(new_log_probs - old_log_probs)
surr1 = ratio * advantages
surr2 = clip(ratio, 1 - cliprange, 1 + cliprange) * advantages
policy_loss = -min(surr1, surr2).mean()
# Optional: add KL penalty
if use_kl_loss:
policy_loss += kl_coef * kl_divergence(new_policy, ref_policy)