Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Volcengine Verl Value Loss Optimization

From Leeroopedia


Knowledge Sources
Domains Reinforcement_Learning, Value_Function, Deep_Learning
Last Updated 2026-02-07 14:00 GMT

Overview

A clipped value function loss that trains a critic network to predict per-token returns while constraining updates to prevent catastrophic value function changes.

Description

Value Loss Optimization trains the critic (value function) in PPO actor-critic architectures. The critic learns to predict the expected return at each token position, providing baselines for GAE advantage estimation.

Similar to the policy loss, the value loss uses a clipping mechanism to prevent large updates. The value function predictions are clipped relative to the old predictions, and the loss is the maximum of the clipped and unclipped squared errors. This conservative update strategy prevents the value function from making overly aggressive changes that could destabilize training.

This principle is only used in PPO workflows (with algorithm.adv_estimator=gae). GRPO workflows do not require a critic and thus do not use value loss.

Usage

Use value loss optimization when training with a full actor-critic PPO architecture that includes a learned value function. The critic loss is computed alongside the policy loss during the training step and is used to update the critic model parameters.

Theoretical Basis

The clipped value loss is:

LVF=max((Vθ(st)Gt)2,(clip(Vθ(st),Vold(st)ϵv,Vold(st)+ϵv)Gt)2)

Where:

  • Vθ(st) is the current value prediction
  • Vold(st) is the old value prediction (from rollout)
  • Gt is the computed return (from GAE: At+Vold(st))
  • ϵv is the value clip range

Pseudo-code:

# Abstract value loss computation
vpred_clipped = old_values + clip(vpred - old_values, -cliprange_v, cliprange_v)
vf_loss1 = (vpred - returns) ** 2
vf_loss2 = (vpred_clipped - returns) ** 2
vf_loss = 0.5 * max(vf_loss1, vf_loss2).mean()

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment