Implementation:Volcengine Verl Compute Policy Loss

Knowledge Sources	verl PPO Dual-Clip PPO
Domains	Reinforcement_Learning, Policy_Optimization
Last Updated	2026-02-07 14:00 GMT

Overview

Concrete tool for computing the PPO clipped surrogate policy loss with support for dual-clip and high-ratio clipping, provided by the verl library.

Description

The compute_policy_loss function computes the clipped policy gradient objective used in Proximal Policy Optimization (PPO). It calculates the probability ratio between the current and old policies, applies standard PPO clipping with configurable asymmetric clip ranges, and additionally supports dual-clip PPO (lower-bounding the ratio for negative advantages). The function returns the aggregated policy loss, clip fractions for monitoring, and an approximate KL divergence metric.

Usage

This function is called during the actor update step of the PPO training loop. It receives log-probabilities from both the old (rollout-time) and current policy, along with advantage estimates from any supported advantage estimator (GAE, GRPO, etc.).

Code Reference

Source Location

Repository: verl
File: verl/trainer/ppo/core_algos.py
Lines: 1084-1156

Signature

@deprecated("verl.trainer.ppo.core_algos.compute_policy_loss_vanilla")
def compute_policy_loss(
    old_log_prob,
    log_prob,
    advantages,
    response_mask,
    cliprange=None,
    cliprange_low=None,
    cliprange_high=None,
    clip_ratio_c=3.0,
    loss_agg_mode: str = "token-mean",
):
    """
    Compute the clipped policy objective and related metrics for PPO.

    Args:
        old_log_prob (torch.Tensor):
            Log-probabilities under the old policy (batch_size, response_length).
        log_prob (torch.Tensor):
            Log-probabilities under the current policy (batch_size, response_length).
        advantages (torch.Tensor):
            Advantage estimates (batch_size, response_length).
        response_mask (torch.Tensor):
            Mask for valid response tokens (batch_size, response_length).
        cliprange (float, optional):
            Clipping parameter epsilon for standard PPO.
        cliprange_low (float, optional):
            Lower clip range for dual-clip PPO. Defaults to cliprange.
        cliprange_high (float, optional):
            Upper clip range for dual-clip PPO. Defaults to cliprange.
        clip_ratio_c (float, optional):
            Lower bound of ratio for dual-clip PPO. Defaults to 3.0.
        loss_agg_mode (str, optional):
            Aggregation mode for the loss. Defaults to "token-mean".

    Returns:
        pg_loss: Aggregated policy gradient loss (scalar).
        pg_clipfrac: Fraction of tokens where standard clipping was active.
        ppo_kl: Approximate KL divergence between old and current policy.
        pg_clipfrac_lower: Fraction of tokens where dual-clip lower bound was active.
    """

Import

from verl.trainer.ppo.core_algos import compute_policy_loss

I/O Contract

Inputs

Name	Type	Required	Description
old_log_prob	torch.Tensor	Yes	Log-probabilities of actions under the old (rollout) policy (batch_size, response_length)
log_prob	torch.Tensor	Yes	Log-probabilities of actions under the current policy (batch_size, response_length)
advantages	torch.Tensor	Yes	Advantage estimates for each token (batch_size, response_length)
response_mask	torch.Tensor	Yes	Binary mask for valid response tokens (batch_size, response_length)
cliprange	Optional[float]	No	Standard PPO clipping epsilon (must be provided in practice)
cliprange_low	Optional[float]	No	Lower clip range for asymmetric clipping (defaults to cliprange)
cliprange_high	Optional[float]	No	Upper clip range for asymmetric clipping (defaults to cliprange)
clip_ratio_c	float	No	Dual-clip lower bound ratio (default: 3.0, must be > 1.0)
loss_agg_mode	str	No	Loss aggregation mode (default: "token-mean")

Outputs

Name	Type	Description
pg_loss	torch.Tensor	Aggregated clipped policy gradient loss (scalar tensor)
pg_clipfrac	torch.Tensor	Fraction of tokens where standard PPO clipping was active
ppo_kl	torch.Tensor	Approximate KL divergence between old and current policy
pg_clipfrac_lower	torch.Tensor	Fraction of tokens where dual-clip lower bound was active

Usage Examples

import torch
from verl.trainer.ppo.core_algos import compute_policy_loss

batch_size = 8
response_length = 128

# Log-probabilities from rollout (old policy) and current policy
old_log_prob = torch.randn(batch_size, response_length) * 0.5 - 2.0
log_prob = old_log_prob + torch.randn(batch_size, response_length) * 0.05

# Advantages from GRPO or GAE
advantages = torch.randn(batch_size, response_length)

# Response mask
response_mask = torch.ones(batch_size, response_length)

pg_loss, pg_clipfrac, ppo_kl, pg_clipfrac_lower = compute_policy_loss(
    old_log_prob=old_log_prob,
    log_prob=log_prob,
    advantages=advantages,
    response_mask=response_mask,
    cliprange=0.2,
    cliprange_low=0.2,
    cliprange_high=0.28,
    clip_ratio_c=3.0,
    loss_agg_mode="token-mean",
)

# pg_loss is the scalar loss to backpropagate
# pg_clipfrac and ppo_kl are monitoring metrics

Related Pages

Implements Principle

Principle:Volcengine_Verl_Policy_Loss_Optimization

Environment Requirements

Heuristics Used

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment