Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Volcengine Verl Compute Policy Loss

From Leeroopedia


Knowledge Sources
Domains Reinforcement_Learning, Policy_Optimization
Last Updated 2026-02-07 14:00 GMT

Overview

Concrete tool for computing the PPO clipped surrogate policy loss with support for dual-clip and high-ratio clipping, provided by the verl library.

Description

The compute_policy_loss function computes the clipped policy gradient objective used in Proximal Policy Optimization (PPO). It calculates the probability ratio between the current and old policies, applies standard PPO clipping with configurable asymmetric clip ranges, and additionally supports dual-clip PPO (lower-bounding the ratio for negative advantages). The function returns the aggregated policy loss, clip fractions for monitoring, and an approximate KL divergence metric.

Usage

This function is called during the actor update step of the PPO training loop. It receives log-probabilities from both the old (rollout-time) and current policy, along with advantage estimates from any supported advantage estimator (GAE, GRPO, etc.).

Code Reference

Source Location

  • Repository: verl
  • File: verl/trainer/ppo/core_algos.py
  • Lines: 1084-1156

Signature

@deprecated("verl.trainer.ppo.core_algos.compute_policy_loss_vanilla")
def compute_policy_loss(
    old_log_prob,
    log_prob,
    advantages,
    response_mask,
    cliprange=None,
    cliprange_low=None,
    cliprange_high=None,
    clip_ratio_c=3.0,
    loss_agg_mode: str = "token-mean",
):
    """
    Compute the clipped policy objective and related metrics for PPO.

    Args:
        old_log_prob (torch.Tensor):
            Log-probabilities under the old policy (batch_size, response_length).
        log_prob (torch.Tensor):
            Log-probabilities under the current policy (batch_size, response_length).
        advantages (torch.Tensor):
            Advantage estimates (batch_size, response_length).
        response_mask (torch.Tensor):
            Mask for valid response tokens (batch_size, response_length).
        cliprange (float, optional):
            Clipping parameter epsilon for standard PPO.
        cliprange_low (float, optional):
            Lower clip range for dual-clip PPO. Defaults to cliprange.
        cliprange_high (float, optional):
            Upper clip range for dual-clip PPO. Defaults to cliprange.
        clip_ratio_c (float, optional):
            Lower bound of ratio for dual-clip PPO. Defaults to 3.0.
        loss_agg_mode (str, optional):
            Aggregation mode for the loss. Defaults to "token-mean".

    Returns:
        pg_loss: Aggregated policy gradient loss (scalar).
        pg_clipfrac: Fraction of tokens where standard clipping was active.
        ppo_kl: Approximate KL divergence between old and current policy.
        pg_clipfrac_lower: Fraction of tokens where dual-clip lower bound was active.
    """

Import

from verl.trainer.ppo.core_algos import compute_policy_loss

I/O Contract

Inputs

Name Type Required Description
old_log_prob torch.Tensor Yes Log-probabilities of actions under the old (rollout) policy (batch_size, response_length)
log_prob torch.Tensor Yes Log-probabilities of actions under the current policy (batch_size, response_length)
advantages torch.Tensor Yes Advantage estimates for each token (batch_size, response_length)
response_mask torch.Tensor Yes Binary mask for valid response tokens (batch_size, response_length)
cliprange Optional[float] No Standard PPO clipping epsilon (must be provided in practice)
cliprange_low Optional[float] No Lower clip range for asymmetric clipping (defaults to cliprange)
cliprange_high Optional[float] No Upper clip range for asymmetric clipping (defaults to cliprange)
clip_ratio_c float No Dual-clip lower bound ratio (default: 3.0, must be > 1.0)
loss_agg_mode str No Loss aggregation mode (default: "token-mean")

Outputs

Name Type Description
pg_loss torch.Tensor Aggregated clipped policy gradient loss (scalar tensor)
pg_clipfrac torch.Tensor Fraction of tokens where standard PPO clipping was active
ppo_kl torch.Tensor Approximate KL divergence between old and current policy
pg_clipfrac_lower torch.Tensor Fraction of tokens where dual-clip lower bound was active

Usage Examples

import torch
from verl.trainer.ppo.core_algos import compute_policy_loss

batch_size = 8
response_length = 128

# Log-probabilities from rollout (old policy) and current policy
old_log_prob = torch.randn(batch_size, response_length) * 0.5 - 2.0
log_prob = old_log_prob + torch.randn(batch_size, response_length) * 0.05

# Advantages from GRPO or GAE
advantages = torch.randn(batch_size, response_length)

# Response mask
response_mask = torch.ones(batch_size, response_length)

pg_loss, pg_clipfrac, ppo_kl, pg_clipfrac_lower = compute_policy_loss(
    old_log_prob=old_log_prob,
    log_prob=log_prob,
    advantages=advantages,
    response_mask=response_mask,
    cliprange=0.2,
    cliprange_low=0.2,
    cliprange_high=0.28,
    clip_ratio_c=3.0,
    loss_agg_mode="token-mean",
)

# pg_loss is the scalar loss to backpropagate
# pg_clipfrac and ppo_kl are monitoring metrics

Related Pages

Implements Principle

Environment Requirements

Heuristics Used

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment