Implementation:OpenRLHF OpenRLHF PolicyLoss
Appearance
| Knowledge Sources | |
|---|---|
| Domains | Reinforcement_Learning, Loss_Functions |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Concrete tool for computing PPO/GSPO policy gradient losses with clipping provided by OpenRLHF.
Description
The PolicyLoss class implements PPO's clipped surrogate objective with extensions for dual-clip PPO, GSPO (sequence-level ratios), and vLLM importance sampling corrections (TIS, ICEPop, seq-mask-TIS). It returns the loss, clip ratio (fraction of clipped updates), and KL divergence estimates.
Usage
Instantiated by the PPO trainer. Called each training step with current and old log-probabilities, advantages, and action masks.
Code Reference
Source Location
- Repository: OpenRLHF
- File: openrlhf/models/loss.py
- Lines: L75-182
Signature
class PolicyLoss(nn.Module):
def __init__(
self,
clip_eps_low: float = 0.2, # Lower clip epsilon
clip_eps_high: float = 0.2, # Upper clip epsilon
dual_clip: float = None, # Dual-clip threshold (None = disabled)
token_level_loss: bool = True, # Token vs sequence level
policy_loss_type: str = "ppo", # "ppo" or "gspo"
enable_vllm_is_correction: bool = False,
vllm_is_truncated_threshold: list = None,
vllm_is_correction_type: str = "tis", # "tis", "icepop", "seq-mask-tis"
) -> None:
def forward(
self,
log_probs: torch.Tensor, # Current policy log-probs
old_log_probs: torch.Tensor, # Old policy log-probs
advantages: torch.Tensor, # GAE advantages
action_mask: Optional[torch.Tensor] = None,
rollout_log_probs: Optional[torch.Tensor] = None,
) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
"""Returns (loss, clip_ratio, ppo_kl, vllm_kl)"""
Import
from openrlhf.models import PolicyLoss
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| log_probs | Tensor | Yes | Current policy action log-probs |
| old_log_probs | Tensor | Yes | Old (rollout) policy log-probs |
| advantages | Tensor | Yes | GAE advantage estimates |
| action_mask | Tensor | No | Binary mask for action tokens |
Outputs
| Name | Type | Description |
|---|---|---|
| loss | Tensor | Scalar policy loss |
| clip_ratio | Tensor | Fraction of clipped updates |
| ppo_kl | Tensor | Approximate KL from old policy |
| vllm_kl | Tensor or None | KL from vLLM rollout policy |
Usage Examples
from openrlhf.models import PolicyLoss
policy_loss_fn = PolicyLoss(
clip_eps_low=0.2,
clip_eps_high=0.2,
dual_clip=None,
)
loss, clip_ratio, ppo_kl, vllm_kl = policy_loss_fn(
log_probs=current_log_probs,
old_log_probs=old_log_probs,
advantages=advantages,
action_mask=action_mask,
)
Related Pages
Implements Principle
Uses Heuristic
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment