Implementation:Volcengine Verl Compute Value Loss

Knowledge Sources	verl PPO
Domains	Reinforcement_Learning, Value_Function_Training
Last Updated	2026-02-07 14:00 GMT

Overview

Concrete tool for computing the clipped value-function loss used to train the critic network in PPO, provided by the verl library.

Description

The compute_value_loss function computes the clipped value-function loss for PPO training. It clips the current value predictions to be within a range of the old baseline values (controlled by cliprange_value), then takes the maximum of the clipped and unclipped squared errors against the target returns. The final loss is 0.5 times the aggregated maximum of these two losses. This clipping mechanism prevents the value function from changing too drastically in a single update, stabilizing training.

Usage

This function is called during the critic update step of the PPO training loop when a learned value function (critic) is used. It is relevant only when algorithm.adv_estimator=gae (not needed for GRPO, which is critic-free).

Code Reference

Source Location

Repository: verl
File: verl/trainer/ppo/core_algos.py
Lines: 1799-1838

Signature

def compute_value_loss(
    vpreds: torch.Tensor,
    returns: torch.Tensor,
    values: torch.Tensor,
    response_mask: torch.Tensor,
    cliprange_value: float,
    loss_agg_mode: str = "token-mean",
):
    """
    Compute the clipped value-function loss for PPO.

    Args:
        vpreds (torch.FloatTensor):
            Predicted values from the value head (batch_size, response_length).
        returns (torch.FloatTensor):
            Ground-truth returns (batch_size, response_length).
        values (torch.FloatTensor):
            Old (baseline) values from the value head (batch_size, response_length).
        response_mask (torch.Tensor):
            Mask indicating which tokens to include (batch_size, response_length).
        cliprange_value (float):
            Clip range for value prediction updates.
        loss_agg_mode (str, optional):
            Aggregation mode for the loss. Defaults to "token-mean".

    Returns:
        vf_loss (torch.FloatTensor):
            Aggregated value-function loss (scalar).
        vf_clipfrac (float):
            Fraction of elements where the clipped loss was used.
    """

Import

from verl.trainer.ppo.core_algos import compute_value_loss

I/O Contract

Inputs

Name	Type	Required	Description
vpreds	torch.Tensor	Yes	Current value predictions from the critic (batch_size, response_length)
returns	torch.Tensor	Yes	Target returns computed from GAE (batch_size, response_length)
values	torch.Tensor	Yes	Old baseline value predictions from the previous iteration (batch_size, response_length)
response_mask	torch.Tensor	Yes	Binary mask for valid response tokens (batch_size, response_length)
cliprange_value	float	Yes	Clip range controlling how much value predictions can change per update
loss_agg_mode	str	No	Loss aggregation mode (default: "token-mean")

Outputs

Name	Type	Description
vf_loss	torch.Tensor	Aggregated clipped value-function loss (scalar tensor)
vf_clipfrac	float	Fraction of elements where the clipped loss exceeded the unclipped loss

Usage Examples

import torch
from verl.trainer.ppo.core_algos import compute_value_loss

batch_size = 8
response_length = 128

# Current critic predictions
vpreds = torch.randn(batch_size, response_length) * 0.1

# Target returns from GAE computation
returns = torch.randn(batch_size, response_length) * 0.1

# Old baseline values (from previous iteration)
values = torch.randn(batch_size, response_length) * 0.1

# Response mask
response_mask = torch.ones(batch_size, response_length)

vf_loss, vf_clipfrac = compute_value_loss(
    vpreds=vpreds,
    returns=returns,
    values=values,
    response_mask=response_mask,
    cliprange_value=0.2,
    loss_agg_mode="token-mean",
)

# vf_loss is the scalar loss to backpropagate through the critic
# vf_clipfrac is a monitoring metric indicating update magnitude

Related Pages

Implements Principle

Principle:Volcengine_Verl_Value_Loss_Optimization

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment