Implementation:Volcengine Verl Compute Policy Loss
| Knowledge Sources | |
|---|---|
| Domains | Reinforcement_Learning, Policy_Optimization |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
Concrete tool for computing the PPO clipped surrogate policy loss with support for dual-clip and high-ratio clipping, provided by the verl library.
Description
The compute_policy_loss function computes the clipped policy gradient objective used in Proximal Policy Optimization (PPO). It calculates the probability ratio between the current and old policies, applies standard PPO clipping with configurable asymmetric clip ranges, and additionally supports dual-clip PPO (lower-bounding the ratio for negative advantages). The function returns the aggregated policy loss, clip fractions for monitoring, and an approximate KL divergence metric.
Usage
This function is called during the actor update step of the PPO training loop. It receives log-probabilities from both the old (rollout-time) and current policy, along with advantage estimates from any supported advantage estimator (GAE, GRPO, etc.).
Code Reference
Source Location
- Repository: verl
- File: verl/trainer/ppo/core_algos.py
- Lines: 1084-1156
Signature
@deprecated("verl.trainer.ppo.core_algos.compute_policy_loss_vanilla")
def compute_policy_loss(
old_log_prob,
log_prob,
advantages,
response_mask,
cliprange=None,
cliprange_low=None,
cliprange_high=None,
clip_ratio_c=3.0,
loss_agg_mode: str = "token-mean",
):
"""
Compute the clipped policy objective and related metrics for PPO.
Args:
old_log_prob (torch.Tensor):
Log-probabilities under the old policy (batch_size, response_length).
log_prob (torch.Tensor):
Log-probabilities under the current policy (batch_size, response_length).
advantages (torch.Tensor):
Advantage estimates (batch_size, response_length).
response_mask (torch.Tensor):
Mask for valid response tokens (batch_size, response_length).
cliprange (float, optional):
Clipping parameter epsilon for standard PPO.
cliprange_low (float, optional):
Lower clip range for dual-clip PPO. Defaults to cliprange.
cliprange_high (float, optional):
Upper clip range for dual-clip PPO. Defaults to cliprange.
clip_ratio_c (float, optional):
Lower bound of ratio for dual-clip PPO. Defaults to 3.0.
loss_agg_mode (str, optional):
Aggregation mode for the loss. Defaults to "token-mean".
Returns:
pg_loss: Aggregated policy gradient loss (scalar).
pg_clipfrac: Fraction of tokens where standard clipping was active.
ppo_kl: Approximate KL divergence between old and current policy.
pg_clipfrac_lower: Fraction of tokens where dual-clip lower bound was active.
"""
Import
from verl.trainer.ppo.core_algos import compute_policy_loss
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| old_log_prob | torch.Tensor | Yes | Log-probabilities of actions under the old (rollout) policy (batch_size, response_length) |
| log_prob | torch.Tensor | Yes | Log-probabilities of actions under the current policy (batch_size, response_length) |
| advantages | torch.Tensor | Yes | Advantage estimates for each token (batch_size, response_length) |
| response_mask | torch.Tensor | Yes | Binary mask for valid response tokens (batch_size, response_length) |
| cliprange | Optional[float] | No | Standard PPO clipping epsilon (must be provided in practice) |
| cliprange_low | Optional[float] | No | Lower clip range for asymmetric clipping (defaults to cliprange) |
| cliprange_high | Optional[float] | No | Upper clip range for asymmetric clipping (defaults to cliprange) |
| clip_ratio_c | float | No | Dual-clip lower bound ratio (default: 3.0, must be > 1.0) |
| loss_agg_mode | str | No | Loss aggregation mode (default: "token-mean") |
Outputs
| Name | Type | Description |
|---|---|---|
| pg_loss | torch.Tensor | Aggregated clipped policy gradient loss (scalar tensor) |
| pg_clipfrac | torch.Tensor | Fraction of tokens where standard PPO clipping was active |
| ppo_kl | torch.Tensor | Approximate KL divergence between old and current policy |
| pg_clipfrac_lower | torch.Tensor | Fraction of tokens where dual-clip lower bound was active |
Usage Examples
import torch
from verl.trainer.ppo.core_algos import compute_policy_loss
batch_size = 8
response_length = 128
# Log-probabilities from rollout (old policy) and current policy
old_log_prob = torch.randn(batch_size, response_length) * 0.5 - 2.0
log_prob = old_log_prob + torch.randn(batch_size, response_length) * 0.05
# Advantages from GRPO or GAE
advantages = torch.randn(batch_size, response_length)
# Response mask
response_mask = torch.ones(batch_size, response_length)
pg_loss, pg_clipfrac, ppo_kl, pg_clipfrac_lower = compute_policy_loss(
old_log_prob=old_log_prob,
log_prob=log_prob,
advantages=advantages,
response_mask=response_mask,
cliprange=0.2,
cliprange_low=0.2,
cliprange_high=0.28,
clip_ratio_c=3.0,
loss_agg_mode="token-mean",
)
# pg_loss is the scalar loss to backpropagate
# pg_clipfrac and ppo_kl are monitoring metrics
Related Pages
Implements Principle
Environment Requirements
- Environment:Volcengine_Verl_CUDA_GPU_Environment
- Environment:Volcengine_Verl_Megatron_Core_Environment