Implementation:Eric mitchell Direct preference optimization Preference Loss

Knowledge Sources	Direct Preference Optimization DPO Paper Conservative DPO
Domains	Preference_Optimization, Loss_Functions, NLP
Last Updated	2026-02-08 02:00 GMT

Overview

Concrete tool for computing the DPO/cDPO/IPO preference loss provided by the direct-preference-optimization repository.

Description

The preference_loss function is the core loss computation of the DPO algorithm. It takes log probabilities from both a policy model and a reference model on chosen and rejected responses, and computes the DPO loss along with implicit reward estimates. It supports three loss variants: standard DPO (Eq. 7 of the paper), conservative DPO with label smoothing, and IPO (Identity Preference Optimization).

Usage

Import this function when computing the preference-based training loss during DPO or IPO training. This is called within get_batch_metrics after obtaining log probabilities from concatenated_forward for both the policy and reference models.

Code Reference

Source Location

Repository: direct-preference-optimization
File: trainers.py
Lines: 45-87

Signature

def preference_loss(
    policy_chosen_logps: torch.FloatTensor,
    policy_rejected_logps: torch.FloatTensor,
    reference_chosen_logps: torch.FloatTensor,
    reference_rejected_logps: torch.FloatTensor,
    beta: float,
    label_smoothing: float = 0.0,
    ipo: bool = False,
    reference_free: bool = False,
) -> Tuple[torch.FloatTensor, torch.FloatTensor, torch.FloatTensor]:
    """Compute the DPO loss for a batch of policy and reference model log probabilities.

    Args:
        policy_chosen_logps: Log probabilities of the policy model for the chosen responses. Shape: (batch_size,)
        policy_rejected_logps: Log probabilities of the policy model for the rejected responses. Shape: (batch_size,)
        reference_chosen_logps: Log probabilities of the reference model for the chosen responses. Shape: (batch_size,)
        reference_rejected_logps: Log probabilities of the reference model for the rejected responses. Shape: (batch_size,)
        beta: Temperature parameter for the DPO loss, typically in range 0.1 to 0.5.
        label_smoothing: Conservative DPO noise parameter; fraction of preferences assumed flipped.
        ipo: If True, use IPO loss instead of DPO loss.
        reference_free: If True, ignore reference model (use uniform reference).

    Returns:
        A tuple of (losses, chosen_rewards, rejected_rewards).
    """

Import

from trainers import preference_loss

I/O Contract

Inputs

Name	Type	Required	Description
policy_chosen_logps	torch.FloatTensor	Yes	Log probs of policy model on chosen responses. Shape: (batch_size,)
policy_rejected_logps	torch.FloatTensor	Yes	Log probs of policy model on rejected responses. Shape: (batch_size,)
reference_chosen_logps	torch.FloatTensor	Yes	Log probs of reference model on chosen responses. Shape: (batch_size,)
reference_rejected_logps	torch.FloatTensor	Yes	Log probs of reference model on rejected responses. Shape: (batch_size,)
beta	float	Yes	DPO temperature parameter (typically 0.1 to 0.5)
label_smoothing	float	No	Conservative DPO noise (default 0.0; range 0 to 0.5)
ipo	bool	No	Use IPO loss variant (default False)
reference_free	bool	No	Ignore reference model, use uniform reference (default False)

Outputs

Name	Type	Description
losses	torch.FloatTensor	Per-example DPO loss. Shape: (batch_size,)
chosen_rewards	torch.FloatTensor	Implicit reward for chosen responses: beta * (policy_logp - ref_logp). Shape: (batch_size,)
rejected_rewards	torch.FloatTensor	Implicit reward for rejected responses: beta * (policy_logp - ref_logp). Shape: (batch_size,)

Usage Examples

Standard DPO Loss

from trainers import preference_loss

# After obtaining log probabilities from concatenated_forward
losses, chosen_rewards, rejected_rewards = preference_loss(
    policy_chosen_logps,
    policy_rejected_logps,
    reference_chosen_logps,
    reference_rejected_logps,
    beta=0.1,
)

# Compute reward accuracy
reward_accuracies = (chosen_rewards > rejected_rewards).float()
loss = losses.mean()
loss.backward()

Conservative DPO (cDPO)

# With label smoothing for noisy preferences
losses, chosen_rewards, rejected_rewards = preference_loss(
    policy_chosen_logps,
    policy_rejected_logps,
    reference_chosen_logps,
    reference_rejected_logps,
    beta=0.1,
    label_smoothing=0.1,  # assume 10% of preferences are flipped
)

IPO Variant

# Identity Preference Optimization
losses, chosen_rewards, rejected_rewards = preference_loss(
    policy_chosen_logps,
    policy_rejected_logps,
    reference_chosen_logps,
    reference_rejected_logps,
    beta=0.1,
    ipo=True,
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment