Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Allenai Open instruct DPO Loss Function

From Leeroopedia


Component Type Function
Source open_instruct/dpo_utils.py (Lines 608-649)
Repository Open Instruct
Dependencies torch, torch.nn.functional
Last Updated 2026-02-07 00:00 GMT

Overview

Concrete tool for computing the standard Direct Preference Optimization loss from policy and reference model log-probabilities, provided by the Open Instruct library.

Description

dpo_loss() implements the core DPO loss function as described in "Direct Preference Optimization: Your Language Model is Secretly a Reward Model" (Rafailov et al., 2023). Given log-probabilities from both the policy model and reference model for chosen and rejected responses, it computes:

  1. Log-ratios: The difference between policy chosen and rejected log-probabilities (pi_logratios), and similarly for the reference model (ref_logratios).
  2. Logits: The difference pi_logratios - ref_logratios, representing the relative preference of the policy over the reference.
  3. Loss: Applies the sigmoid loss with optional label smoothing: -logsigmoid(beta * logits) * (1 - label_smoothing) - logsigmoid(-beta * logits) * label_smoothing.
  4. Implicit rewards: Computes detached reward metrics for monitoring: beta * (policy_logps - reference_logps) for both chosen and rejected.

The reference_free option sets the reference log-ratios to zero, effectively using a uniform reference policy.

Usage

Import and call dpo_loss() when computing the standard DPO or DPO-norm loss within a training loop. For SimPO or WPO, use the dedicated simpo_loss() or wpo_loss() functions instead, or use the higher-level compute_loss() dispatcher.

Code Reference

Source Location

  • Repository: Open Instruct
  • File: open_instruct/dpo_utils.py (Lines 608-649)

Signature

def dpo_loss(
    policy_chosen_logps: torch.Tensor,
    policy_rejected_logps: torch.Tensor,
    reference_chosen_logps: torch.Tensor,
    reference_rejected_logps: torch.Tensor,
    beta: float,
    reference_free: bool = False,
    label_smoothing: float = 0.0,
) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]:

Import

from open_instruct.dpo_utils import dpo_loss

I/O Contract

Inputs

Parameter Type Description
policy_chosen_logps torch.Tensor Log-probabilities of the policy model for chosen responses. Shape: (batch_size,).
policy_rejected_logps torch.Tensor Log-probabilities of the policy model for rejected responses. Shape: (batch_size,).
reference_chosen_logps torch.Tensor Log-probabilities of the reference model for chosen responses. Shape: (batch_size,).
reference_rejected_logps torch.Tensor Log-probabilities of the reference model for rejected responses. Shape: (batch_size,).
beta float Temperature parameter, typically in the range 0.1 to 0.5. Higher values make the loss more sensitive to preference differences.
reference_free bool If True, ignores the reference model and uses a uniform reference (sets reference log-ratios to 0). Default: False.
label_smoothing float Label smoothing parameter in [0, 1). Default: 0.0 (no smoothing).

Outputs

Output Type Description
losses torch.Tensor Per-example DPO losses. Shape: (batch_size,).
chosen_rewards torch.Tensor Implicit rewards for chosen responses (detached). Shape: (batch_size,).
rejected_rewards torch.Tensor Implicit rewards for rejected responses (detached). Shape: (batch_size,).

Usage Examples

import torch
from open_instruct.dpo_utils import dpo_loss

# Example with batch_size=4
policy_chosen = torch.tensor([-1.2, -0.8, -1.5, -0.9])
policy_rejected = torch.tensor([-2.1, -1.5, -1.8, -2.0])
ref_chosen = torch.tensor([-1.4, -1.0, -1.6, -1.1])
ref_rejected = torch.tensor([-1.9, -1.3, -1.7, -1.8])

losses, chosen_rewards, rejected_rewards = dpo_loss(
    policy_chosen_logps=policy_chosen,
    policy_rejected_logps=policy_rejected,
    reference_chosen_logps=ref_chosen,
    reference_rejected_logps=ref_rejected,
    beta=0.1,
    label_smoothing=0.0,
)

# losses: per-example DPO losses (batch_size,)
# chosen_rewards: implicit rewards for chosen (batch_size,)
# rejected_rewards: implicit rewards for rejected (batch_size,)
mean_loss = losses.mean()
mean_loss.backward()

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment