Implementation:OpenRLHF OpenRLHF PolicyLoss

Knowledge Sources	OpenRLHF
Domains	Reinforcement_Learning, Loss_Functions
Last Updated	2026-02-07 00:00 GMT

Overview

Concrete tool for computing PPO/GSPO policy gradient losses with clipping provided by OpenRLHF.

Description

The PolicyLoss class implements PPO's clipped surrogate objective with extensions for dual-clip PPO, GSPO (sequence-level ratios), and vLLM importance sampling corrections (TIS, ICEPop, seq-mask-TIS). It returns the loss, clip ratio (fraction of clipped updates), and KL divergence estimates.

Usage

Instantiated by the PPO trainer. Called each training step with current and old log-probabilities, advantages, and action masks.

Code Reference

Source Location

Repository: OpenRLHF
File: openrlhf/models/loss.py
Lines: L75-182

Signature

class PolicyLoss(nn.Module):
    def __init__(
        self,
        clip_eps_low: float = 0.2,            # Lower clip epsilon
        clip_eps_high: float = 0.2,           # Upper clip epsilon
        dual_clip: float = None,              # Dual-clip threshold (None = disabled)
        token_level_loss: bool = True,        # Token vs sequence level
        policy_loss_type: str = "ppo",        # "ppo" or "gspo"
        enable_vllm_is_correction: bool = False,
        vllm_is_truncated_threshold: list = None,
        vllm_is_correction_type: str = "tis", # "tis", "icepop", "seq-mask-tis"
    ) -> None:

    def forward(
        self,
        log_probs: torch.Tensor,              # Current policy log-probs
        old_log_probs: torch.Tensor,          # Old policy log-probs
        advantages: torch.Tensor,             # GAE advantages
        action_mask: Optional[torch.Tensor] = None,
        rollout_log_probs: Optional[torch.Tensor] = None,
    ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
        """Returns (loss, clip_ratio, ppo_kl, vllm_kl)"""

Import

from openrlhf.models import PolicyLoss

I/O Contract

Inputs

Name	Type	Required	Description
log_probs	Tensor	Yes	Current policy action log-probs
old_log_probs	Tensor	Yes	Old (rollout) policy log-probs
advantages	Tensor	Yes	GAE advantage estimates
action_mask	Tensor	No	Binary mask for action tokens

Outputs

Name	Type	Description
loss	Tensor	Scalar policy loss
clip_ratio	Tensor	Fraction of clipped updates
ppo_kl	Tensor	Approximate KL from old policy
vllm_kl	Tensor or None	KL from vLLM rollout policy

Usage Examples

from openrlhf.models import PolicyLoss

policy_loss_fn = PolicyLoss(
    clip_eps_low=0.2,
    clip_eps_high=0.2,
    dual_clip=None,
)

loss, clip_ratio, ppo_kl, vllm_kl = policy_loss_fn(
    log_probs=current_log_probs,
    old_log_probs=old_log_probs,
    advantages=advantages,
    action_mask=action_mask,
)

Related Pages

Implements Principle

Principle:OpenRLHF_OpenRLHF_PPO_Policy_Loss

Uses Heuristic

Heuristic:OpenRLHF_OpenRLHF_Off_Policy_IS_Correction_Tip

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment