Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:OpenRLHF OpenRLHF PolicyLoss

From Leeroopedia


Knowledge Sources
Domains Reinforcement_Learning, Loss_Functions
Last Updated 2026-02-07 00:00 GMT

Overview

Concrete tool for computing PPO/GSPO policy gradient losses with clipping provided by OpenRLHF.

Description

The PolicyLoss class implements PPO's clipped surrogate objective with extensions for dual-clip PPO, GSPO (sequence-level ratios), and vLLM importance sampling corrections (TIS, ICEPop, seq-mask-TIS). It returns the loss, clip ratio (fraction of clipped updates), and KL divergence estimates.

Usage

Instantiated by the PPO trainer. Called each training step with current and old log-probabilities, advantages, and action masks.

Code Reference

Source Location

  • Repository: OpenRLHF
  • File: openrlhf/models/loss.py
  • Lines: L75-182

Signature

class PolicyLoss(nn.Module):
    def __init__(
        self,
        clip_eps_low: float = 0.2,            # Lower clip epsilon
        clip_eps_high: float = 0.2,           # Upper clip epsilon
        dual_clip: float = None,              # Dual-clip threshold (None = disabled)
        token_level_loss: bool = True,        # Token vs sequence level
        policy_loss_type: str = "ppo",        # "ppo" or "gspo"
        enable_vllm_is_correction: bool = False,
        vllm_is_truncated_threshold: list = None,
        vllm_is_correction_type: str = "tis", # "tis", "icepop", "seq-mask-tis"
    ) -> None:

    def forward(
        self,
        log_probs: torch.Tensor,              # Current policy log-probs
        old_log_probs: torch.Tensor,          # Old policy log-probs
        advantages: torch.Tensor,             # GAE advantages
        action_mask: Optional[torch.Tensor] = None,
        rollout_log_probs: Optional[torch.Tensor] = None,
    ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
        """Returns (loss, clip_ratio, ppo_kl, vllm_kl)"""

Import

from openrlhf.models import PolicyLoss

I/O Contract

Inputs

Name Type Required Description
log_probs Tensor Yes Current policy action log-probs
old_log_probs Tensor Yes Old (rollout) policy log-probs
advantages Tensor Yes GAE advantage estimates
action_mask Tensor No Binary mask for action tokens

Outputs

Name Type Description
loss Tensor Scalar policy loss
clip_ratio Tensor Fraction of clipped updates
ppo_kl Tensor Approximate KL from old policy
vllm_kl Tensor or None KL from vLLM rollout policy

Usage Examples

from openrlhf.models import PolicyLoss

policy_loss_fn = PolicyLoss(
    clip_eps_low=0.2,
    clip_eps_high=0.2,
    dual_clip=None,
)

loss, clip_ratio, ppo_kl, vllm_kl = policy_loss_fn(
    log_probs=current_log_probs,
    old_log_probs=old_log_probs,
    advantages=advantages,
    action_mask=action_mask,
)

Related Pages

Implements Principle

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment