Principle:Volcengine Verl Value Loss Optimization
| Knowledge Sources | |
|---|---|
| Domains | Reinforcement_Learning, Value_Function, Deep_Learning |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
A clipped value function loss that trains a critic network to predict per-token returns while constraining updates to prevent catastrophic value function changes.
Description
Value Loss Optimization trains the critic (value function) in PPO actor-critic architectures. The critic learns to predict the expected return at each token position, providing baselines for GAE advantage estimation.
Similar to the policy loss, the value loss uses a clipping mechanism to prevent large updates. The value function predictions are clipped relative to the old predictions, and the loss is the maximum of the clipped and unclipped squared errors. This conservative update strategy prevents the value function from making overly aggressive changes that could destabilize training.
This principle is only used in PPO workflows (with algorithm.adv_estimator=gae). GRPO workflows do not require a critic and thus do not use value loss.
Usage
Use value loss optimization when training with a full actor-critic PPO architecture that includes a learned value function. The critic loss is computed alongside the policy loss during the training step and is used to update the critic model parameters.
Theoretical Basis
The clipped value loss is:
Where:
- is the current value prediction
- is the old value prediction (from rollout)
- is the computed return (from GAE: )
- is the value clip range
Pseudo-code:
# Abstract value loss computation
vpred_clipped = old_values + clip(vpred - old_values, -cliprange_v, cliprange_v)
vf_loss1 = (vpred - returns) ** 2
vf_loss2 = (vpred_clipped - returns) ** 2
vf_loss = 0.5 * max(vf_loss1, vf_loss2).mean()