Implementation:Volcengine Verl Compute Value Loss
| Knowledge Sources | |
|---|---|
| Domains | Reinforcement_Learning, Value_Function_Training |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
Concrete tool for computing the clipped value-function loss used to train the critic network in PPO, provided by the verl library.
Description
The compute_value_loss function computes the clipped value-function loss for PPO training. It clips the current value predictions to be within a range of the old baseline values (controlled by cliprange_value), then takes the maximum of the clipped and unclipped squared errors against the target returns. The final loss is 0.5 times the aggregated maximum of these two losses. This clipping mechanism prevents the value function from changing too drastically in a single update, stabilizing training.
Usage
This function is called during the critic update step of the PPO training loop when a learned value function (critic) is used. It is relevant only when algorithm.adv_estimator=gae (not needed for GRPO, which is critic-free).
Code Reference
Source Location
- Repository: verl
- File: verl/trainer/ppo/core_algos.py
- Lines: 1799-1838
Signature
def compute_value_loss(
vpreds: torch.Tensor,
returns: torch.Tensor,
values: torch.Tensor,
response_mask: torch.Tensor,
cliprange_value: float,
loss_agg_mode: str = "token-mean",
):
"""
Compute the clipped value-function loss for PPO.
Args:
vpreds (torch.FloatTensor):
Predicted values from the value head (batch_size, response_length).
returns (torch.FloatTensor):
Ground-truth returns (batch_size, response_length).
values (torch.FloatTensor):
Old (baseline) values from the value head (batch_size, response_length).
response_mask (torch.Tensor):
Mask indicating which tokens to include (batch_size, response_length).
cliprange_value (float):
Clip range for value prediction updates.
loss_agg_mode (str, optional):
Aggregation mode for the loss. Defaults to "token-mean".
Returns:
vf_loss (torch.FloatTensor):
Aggregated value-function loss (scalar).
vf_clipfrac (float):
Fraction of elements where the clipped loss was used.
"""
Import
from verl.trainer.ppo.core_algos import compute_value_loss
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| vpreds | torch.Tensor | Yes | Current value predictions from the critic (batch_size, response_length) |
| returns | torch.Tensor | Yes | Target returns computed from GAE (batch_size, response_length) |
| values | torch.Tensor | Yes | Old baseline value predictions from the previous iteration (batch_size, response_length) |
| response_mask | torch.Tensor | Yes | Binary mask for valid response tokens (batch_size, response_length) |
| cliprange_value | float | Yes | Clip range controlling how much value predictions can change per update |
| loss_agg_mode | str | No | Loss aggregation mode (default: "token-mean") |
Outputs
| Name | Type | Description |
|---|---|---|
| vf_loss | torch.Tensor | Aggregated clipped value-function loss (scalar tensor) |
| vf_clipfrac | float | Fraction of elements where the clipped loss exceeded the unclipped loss |
Usage Examples
import torch
from verl.trainer.ppo.core_algos import compute_value_loss
batch_size = 8
response_length = 128
# Current critic predictions
vpreds = torch.randn(batch_size, response_length) * 0.1
# Target returns from GAE computation
returns = torch.randn(batch_size, response_length) * 0.1
# Old baseline values (from previous iteration)
values = torch.randn(batch_size, response_length) * 0.1
# Response mask
response_mask = torch.ones(batch_size, response_length)
vf_loss, vf_clipfrac = compute_value_loss(
vpreds=vpreds,
returns=returns,
values=values,
response_mask=response_mask,
cliprange_value=0.2,
loss_agg_mode="token-mean",
)
# vf_loss is the scalar loss to backpropagate through the critic
# vf_clipfrac is a monitoring metric indicating update magnitude