Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:OpenRLHF OpenRLHF DPOLoss

From Leeroopedia


Knowledge Sources
Domains Alignment, Loss_Functions
Last Updated 2026-02-07 00:00 GMT

Overview

Concrete tool for computing DPO/IPO/cDPO loss functions provided by OpenRLHF.

Description

The DPOLoss class implements standard DPO, conservative DPO (label smoothing), and IPO loss variants. It takes policy and reference log-probabilities for chosen and rejected responses, computes the log-ratio margin, and applies the selected loss function. It also returns implicit chosen and rejected reward values for monitoring.

Usage

Instantiated by DPOTrainer with beta, label_smoothing, and ipo parameters. Not typically used directly.

Code Reference

Source Location

  • Repository: OpenRLHF
  • File: openrlhf/models/loss.py
  • Lines: L246-281

Signature

class DPOLoss(nn.Module):
    def __init__(
        self,
        beta: float,                  # DPO regularization coefficient
        label_smoothing: float = 0.0, # cDPO label smoothing (0 = standard DPO)
        ipo: bool = False,            # Use IPO squared loss
    ) -> None:

    def forward(
        self,
        policy_chosen_logps: torch.Tensor,      # Policy log-probs for chosen
        policy_rejected_logps: torch.Tensor,     # Policy log-probs for rejected
        reference_chosen_logps: torch.Tensor,    # Reference log-probs for chosen
        reference_rejected_logps: torch.Tensor,  # Reference log-probs for rejected
    ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
        """Returns (loss, chosen_rewards, rejected_rewards)"""

Import

from openrlhf.models import DPOLoss

I/O Contract

Inputs

Name Type Required Description
policy_chosen_logps Tensor Yes Sum of log-probs for chosen responses under policy
policy_rejected_logps Tensor Yes Sum of log-probs for rejected responses under policy
reference_chosen_logps Tensor Yes Sum of log-probs for chosen under reference
reference_rejected_logps Tensor Yes Sum of log-probs for rejected under reference

Outputs

Name Type Description
loss Tensor Scalar DPO loss
chosen_rewards Tensor Implicit rewards for chosen (beta * log ratio)
rejected_rewards Tensor Implicit rewards for rejected

Usage Examples

from openrlhf.models import DPOLoss

loss_fn = DPOLoss(beta=0.1, label_smoothing=0.0, ipo=False)
loss, chosen_rewards, rejected_rewards = loss_fn(
    policy_chosen_logps,
    policy_rejected_logps,
    reference_chosen_logps,
    reference_rejected_logps,
)

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment