Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Volcengine Verl Compute Value Loss

From Leeroopedia


Knowledge Sources
Domains Reinforcement_Learning, Value_Function_Training
Last Updated 2026-02-07 14:00 GMT

Overview

Concrete tool for computing the clipped value-function loss used to train the critic network in PPO, provided by the verl library.

Description

The compute_value_loss function computes the clipped value-function loss for PPO training. It clips the current value predictions to be within a range of the old baseline values (controlled by cliprange_value), then takes the maximum of the clipped and unclipped squared errors against the target returns. The final loss is 0.5 times the aggregated maximum of these two losses. This clipping mechanism prevents the value function from changing too drastically in a single update, stabilizing training.

Usage

This function is called during the critic update step of the PPO training loop when a learned value function (critic) is used. It is relevant only when algorithm.adv_estimator=gae (not needed for GRPO, which is critic-free).

Code Reference

Source Location

  • Repository: verl
  • File: verl/trainer/ppo/core_algos.py
  • Lines: 1799-1838

Signature

def compute_value_loss(
    vpreds: torch.Tensor,
    returns: torch.Tensor,
    values: torch.Tensor,
    response_mask: torch.Tensor,
    cliprange_value: float,
    loss_agg_mode: str = "token-mean",
):
    """
    Compute the clipped value-function loss for PPO.

    Args:
        vpreds (torch.FloatTensor):
            Predicted values from the value head (batch_size, response_length).
        returns (torch.FloatTensor):
            Ground-truth returns (batch_size, response_length).
        values (torch.FloatTensor):
            Old (baseline) values from the value head (batch_size, response_length).
        response_mask (torch.Tensor):
            Mask indicating which tokens to include (batch_size, response_length).
        cliprange_value (float):
            Clip range for value prediction updates.
        loss_agg_mode (str, optional):
            Aggregation mode for the loss. Defaults to "token-mean".

    Returns:
        vf_loss (torch.FloatTensor):
            Aggregated value-function loss (scalar).
        vf_clipfrac (float):
            Fraction of elements where the clipped loss was used.
    """

Import

from verl.trainer.ppo.core_algos import compute_value_loss

I/O Contract

Inputs

Name Type Required Description
vpreds torch.Tensor Yes Current value predictions from the critic (batch_size, response_length)
returns torch.Tensor Yes Target returns computed from GAE (batch_size, response_length)
values torch.Tensor Yes Old baseline value predictions from the previous iteration (batch_size, response_length)
response_mask torch.Tensor Yes Binary mask for valid response tokens (batch_size, response_length)
cliprange_value float Yes Clip range controlling how much value predictions can change per update
loss_agg_mode str No Loss aggregation mode (default: "token-mean")

Outputs

Name Type Description
vf_loss torch.Tensor Aggregated clipped value-function loss (scalar tensor)
vf_clipfrac float Fraction of elements where the clipped loss exceeded the unclipped loss

Usage Examples

import torch
from verl.trainer.ppo.core_algos import compute_value_loss

batch_size = 8
response_length = 128

# Current critic predictions
vpreds = torch.randn(batch_size, response_length) * 0.1

# Target returns from GAE computation
returns = torch.randn(batch_size, response_length) * 0.1

# Old baseline values (from previous iteration)
values = torch.randn(batch_size, response_length) * 0.1

# Response mask
response_mask = torch.ones(batch_size, response_length)

vf_loss, vf_clipfrac = compute_value_loss(
    vpreds=vpreds,
    returns=returns,
    values=values,
    response_mask=response_mask,
    cliprange_value=0.2,
    loss_agg_mode="token-mean",
)

# vf_loss is the scalar loss to backpropagate through the critic
# vf_clipfrac is a monitoring metric indicating update magnitude

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment