Implementation:OpenRLHF OpenRLHF ValueLoss

Knowledge Sources	OpenRLHF
Domains	Reinforcement_Learning, Loss_Functions
Last Updated	2026-02-07 00:00 GMT

Overview

Concrete tool for computing clipped value function losses for PPO critic training provided by OpenRLHF.

Description

The ValueLoss class computes the clipped squared error between predicted values and returns. When clip_eps is set, it clips the new values to be within clip_eps of the old values, then takes the maximum of clipped and unclipped losses. The result is multiplied by 0.5.

Usage

Instantiated by the PPO critic trainer. Called each training step with current and old value predictions and computed returns.

Code Reference

Source Location

Repository: OpenRLHF
File: openrlhf/models/loss.py
Lines: L185-215

Signature

class ValueLoss(nn.Module):
    def __init__(
        self,
        clip_eps: float = None,        # Value clip range (None = no clipping)
        token_level_loss: bool = True,  # Token vs sequence level
    ) -> None:

    def forward(
        self,
        values: torch.Tensor,          # Current value predictions
        old_values: torch.Tensor,      # Previous value predictions
        returns: torch.Tensor,         # Computed returns (from GAE)
        action_mask: Optional[torch.Tensor] = None,
    ) -> torch.Tensor:
        """Returns scalar value loss (0.5 * MSE)."""

Import

from openrlhf.models import ValueLoss

I/O Contract

Inputs

Name	Type	Required	Description
values	Tensor	Yes	Current critic value predictions
old_values	Tensor	Yes	Previous value predictions (from rollout)
returns	Tensor	Yes	Computed returns from GAE
action_mask	Tensor	No	Binary mask for action tokens

Outputs

Name	Type	Description
loss	Tensor	Scalar value loss (0.5 * clipped MSE)

Usage Examples

from openrlhf.models import ValueLoss

value_loss_fn = ValueLoss(clip_eps=0.2)
v_loss = value_loss_fn(values, old_values, returns, action_mask)

Related Pages

Implements Principle

Principle:OpenRLHF_OpenRLHF_PPO_Value_Loss

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment