Principle:Volcengine Verl Value Loss Optimization

Knowledge Sources	Proximal Policy Optimization Algorithms High-Dimensional Continuous Control Using Generalized Advantage Estimation
Domains	Reinforcement_Learning, Value_Function, Deep_Learning
Last Updated	2026-02-07 14:00 GMT

Overview

A clipped value function loss that trains a critic network to predict per-token returns while constraining updates to prevent catastrophic value function changes.

Description

Value Loss Optimization trains the critic (value function) in PPO actor-critic architectures. The critic learns to predict the expected return at each token position, providing baselines for GAE advantage estimation.

Similar to the policy loss, the value loss uses a clipping mechanism to prevent large updates. The value function predictions are clipped relative to the old predictions, and the loss is the maximum of the clipped and unclipped squared errors. This conservative update strategy prevents the value function from making overly aggressive changes that could destabilize training.

This principle is only used in PPO workflows (with algorithm.adv_estimator=gae). GRPO workflows do not require a critic and thus do not use value loss.

Usage

Use value loss optimization when training with a full actor-critic PPO architecture that includes a learned value function. The critic loss is computed alongside the policy loss during the training step and is used to update the critic model parameters.

Theoretical Basis

The clipped value loss is:

$L^{V F} = \max ((V_{θ} (s_{t}) - G_{t})^{2}, (clip (V_{θ} (s_{t}), V_{o l d} (s_{t}) - ϵ_{v}, V_{o l d} (s_{t}) + ϵ_{v}) - G_{t})^{2})$

Where:

$V_{θ} (s_{t})$ is the current value prediction
$V_{o l d} (s_{t})$ is the old value prediction (from rollout)
$G_{t}$ is the computed return (from GAE: $A_{t} + V_{o l d} (s_{t})$ )
$ϵ_{v}$ is the value clip range

Pseudo-code:

# Abstract value loss computation
vpred_clipped = old_values + clip(vpred - old_values, -cliprange_v, cliprange_v)
vf_loss1 = (vpred - returns) ** 2
vf_loss2 = (vpred_clipped - returns) ** 2
vf_loss = 0.5 * max(vf_loss1, vf_loss2).mean()

Related Pages

Implemented By

Implementation:Volcengine_Verl_Compute_Value_Loss

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment