Principle:Volcengine Verl Policy Loss Optimization

Knowledge Sources	Proximal Policy Optimization Algorithms GRPO: Group Relative Policy Optimization
Domains	Reinforcement_Learning, Policy_Optimization, Deep_Learning
Last Updated	2026-02-07 14:00 GMT

Overview

A clipped surrogate objective that constrains policy updates to a trust region, preventing destructively large steps during reinforcement learning training.

Description

Policy Loss Optimization in verl implements the PPO clipped surrogate objective, which is the central optimization target for both PPO and GRPO training. The clipping mechanism ensures that the policy does not change too dramatically in a single update, maintaining training stability.

The key idea is to compute the ratio of new policy probabilities to old policy probabilities for each token, then clip this ratio to a narrow range (typically [1-epsilon, 1+epsilon]). The final loss is the minimum of the unclipped and clipped objectives, which creates a pessimistic bound that penalizes large deviations from the old policy.

This principle also encompasses optional extensions:

KL divergence penalty: An additional loss term penalizing divergence from a reference policy
Dual-clip PPO: An extension that also clips from below when advantages are negative, preventing the policy from moving away from good actions
Entropy bonus: Encouraging exploration by rewarding higher entropy in the policy distribution

Usage

Use this loss function whenever performing policy gradient updates in RL-based LLM training. It is the core optimization step in both GRPO and PPO workflows, applied after advantage estimation and before model weight updates.

The policy loss is computed on mini-batches of rollout data and backpropagated through the actor model.

Theoretical Basis

The PPO clipped surrogate objective is:

$L^{C L I P} (θ) = {\hat{E}}_{t} [\min (r_{t} (θ) {\hat{A}}_{t}, clip (r_{t} (θ), 1 - ϵ, 1 + ϵ) {\hat{A}}_{t})]$

Where:

$r_{t} (θ) = \frac{π_{θ} (a_{t} | s_{t})}{π_{θ_{o l d}} (a_{t} | s_{t})}$ is the probability ratio (computed in log-space as $\exp (\log π_{n e w} - \log π_{o l d})$ )
${\hat{A}}_{t}$ is the estimated advantage (from GAE or GRPO)
$ϵ$ is the clip range (typically 0.2)

Optional extensions:

Dual-clip PPO: $L^{D U A L} (θ) = \max (L^{C L I P} (θ), c \cdot {\hat{A}}_{t}) when {\hat{A}}_{t} < 0$

Where $c$ is the dual-clip coefficient (typically 5.0).

KL penalty: $L^{K L} = β \cdot D_{K L} (π_{θ} | | π_{r e f})$

Pseudo-code:

# Abstract policy loss computation
ratio = exp(new_log_probs - old_log_probs)
surr1 = ratio * advantages
surr2 = clip(ratio, 1 - cliprange, 1 + cliprange) * advantages
policy_loss = -min(surr1, surr2).mean()
# Optional: add KL penalty
if use_kl_loss:
    policy_loss += kl_coef * kl_divergence(new_policy, ref_policy)

Related Pages

Implemented By

Implementation:Volcengine_Verl_Compute_Policy_Loss

Heuristics Used

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment