Implementation:OpenRLHF OpenRLHF DPOLoss
Appearance
| Knowledge Sources | |
|---|---|
| Domains | Alignment, Loss_Functions |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Concrete tool for computing DPO/IPO/cDPO loss functions provided by OpenRLHF.
Description
The DPOLoss class implements standard DPO, conservative DPO (label smoothing), and IPO loss variants. It takes policy and reference log-probabilities for chosen and rejected responses, computes the log-ratio margin, and applies the selected loss function. It also returns implicit chosen and rejected reward values for monitoring.
Usage
Instantiated by DPOTrainer with beta, label_smoothing, and ipo parameters. Not typically used directly.
Code Reference
Source Location
- Repository: OpenRLHF
- File: openrlhf/models/loss.py
- Lines: L246-281
Signature
class DPOLoss(nn.Module):
def __init__(
self,
beta: float, # DPO regularization coefficient
label_smoothing: float = 0.0, # cDPO label smoothing (0 = standard DPO)
ipo: bool = False, # Use IPO squared loss
) -> None:
def forward(
self,
policy_chosen_logps: torch.Tensor, # Policy log-probs for chosen
policy_rejected_logps: torch.Tensor, # Policy log-probs for rejected
reference_chosen_logps: torch.Tensor, # Reference log-probs for chosen
reference_rejected_logps: torch.Tensor, # Reference log-probs for rejected
) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
"""Returns (loss, chosen_rewards, rejected_rewards)"""
Import
from openrlhf.models import DPOLoss
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| policy_chosen_logps | Tensor | Yes | Sum of log-probs for chosen responses under policy |
| policy_rejected_logps | Tensor | Yes | Sum of log-probs for rejected responses under policy |
| reference_chosen_logps | Tensor | Yes | Sum of log-probs for chosen under reference |
| reference_rejected_logps | Tensor | Yes | Sum of log-probs for rejected under reference |
Outputs
| Name | Type | Description |
|---|---|---|
| loss | Tensor | Scalar DPO loss |
| chosen_rewards | Tensor | Implicit rewards for chosen (beta * log ratio) |
| rejected_rewards | Tensor | Implicit rewards for rejected |
Usage Examples
from openrlhf.models import DPOLoss
loss_fn = DPOLoss(beta=0.1, label_smoothing=0.0, ipo=False)
loss, chosen_rewards, rejected_rewards = loss_fn(
policy_chosen_logps,
policy_rejected_logps,
reference_chosen_logps,
reference_rejected_logps,
)
Related Pages
Implements Principle
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment