Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Volcengine Verl Policy Loss Optimization

From Leeroopedia


Knowledge Sources
Domains Reinforcement_Learning, Policy_Optimization, Deep_Learning
Last Updated 2026-02-07 14:00 GMT

Overview

A clipped surrogate objective that constrains policy updates to a trust region, preventing destructively large steps during reinforcement learning training.

Description

Policy Loss Optimization in verl implements the PPO clipped surrogate objective, which is the central optimization target for both PPO and GRPO training. The clipping mechanism ensures that the policy does not change too dramatically in a single update, maintaining training stability.

The key idea is to compute the ratio of new policy probabilities to old policy probabilities for each token, then clip this ratio to a narrow range (typically [1-epsilon, 1+epsilon]). The final loss is the minimum of the unclipped and clipped objectives, which creates a pessimistic bound that penalizes large deviations from the old policy.

This principle also encompasses optional extensions:

  • KL divergence penalty: An additional loss term penalizing divergence from a reference policy
  • Dual-clip PPO: An extension that also clips from below when advantages are negative, preventing the policy from moving away from good actions
  • Entropy bonus: Encouraging exploration by rewarding higher entropy in the policy distribution

Usage

Use this loss function whenever performing policy gradient updates in RL-based LLM training. It is the core optimization step in both GRPO and PPO workflows, applied after advantage estimation and before model weight updates.

The policy loss is computed on mini-batches of rollout data and backpropagated through the actor model.

Theoretical Basis

The PPO clipped surrogate objective is:

LCLIP(θ)=E^t[min(rt(θ)A^t,clip(rt(θ),1ϵ,1+ϵ)A^t)]

Where:

  • rt(θ)=πθ(at|st)πθold(at|st) is the probability ratio (computed in log-space as exp(logπnewlogπold))
  • A^t is the estimated advantage (from GAE or GRPO)
  • ϵ is the clip range (typically 0.2)

Optional extensions:

Dual-clip PPO: LDUAL(θ)=max(LCLIP(θ),cA^t) when A^t<0

Where c is the dual-clip coefficient (typically 5.0).

KL penalty: LKL=βDKL(πθ||πref)

Pseudo-code:

# Abstract policy loss computation
ratio = exp(new_log_probs - old_log_probs)
surr1 = ratio * advantages
surr2 = clip(ratio, 1 - cliprange, 1 + cliprange) * advantages
policy_loss = -min(surr1, surr2).mean()
# Optional: add KL penalty
if use_kl_loss:
    policy_loss += kl_coef * kl_divergence(new_policy, ref_policy)

Related Pages

Implemented By

Heuristics Used

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment