Implementation:Eric mitchell Direct preference optimization Preference Loss
| Knowledge Sources | |
|---|---|
| Domains | Preference_Optimization, Loss_Functions, NLP |
| Last Updated | 2026-02-08 02:00 GMT |
Overview
Concrete tool for computing the DPO/cDPO/IPO preference loss provided by the direct-preference-optimization repository.
Description
The preference_loss function is the core loss computation of the DPO algorithm. It takes log probabilities from both a policy model and a reference model on chosen and rejected responses, and computes the DPO loss along with implicit reward estimates. It supports three loss variants: standard DPO (Eq. 7 of the paper), conservative DPO with label smoothing, and IPO (Identity Preference Optimization).
Usage
Import this function when computing the preference-based training loss during DPO or IPO training. This is called within get_batch_metrics after obtaining log probabilities from concatenated_forward for both the policy and reference models.
Code Reference
Source Location
- Repository: direct-preference-optimization
- File: trainers.py
- Lines: 45-87
Signature
def preference_loss(
policy_chosen_logps: torch.FloatTensor,
policy_rejected_logps: torch.FloatTensor,
reference_chosen_logps: torch.FloatTensor,
reference_rejected_logps: torch.FloatTensor,
beta: float,
label_smoothing: float = 0.0,
ipo: bool = False,
reference_free: bool = False,
) -> Tuple[torch.FloatTensor, torch.FloatTensor, torch.FloatTensor]:
"""Compute the DPO loss for a batch of policy and reference model log probabilities.
Args:
policy_chosen_logps: Log probabilities of the policy model for the chosen responses. Shape: (batch_size,)
policy_rejected_logps: Log probabilities of the policy model for the rejected responses. Shape: (batch_size,)
reference_chosen_logps: Log probabilities of the reference model for the chosen responses. Shape: (batch_size,)
reference_rejected_logps: Log probabilities of the reference model for the rejected responses. Shape: (batch_size,)
beta: Temperature parameter for the DPO loss, typically in range 0.1 to 0.5.
label_smoothing: Conservative DPO noise parameter; fraction of preferences assumed flipped.
ipo: If True, use IPO loss instead of DPO loss.
reference_free: If True, ignore reference model (use uniform reference).
Returns:
A tuple of (losses, chosen_rewards, rejected_rewards).
"""
Import
from trainers import preference_loss
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| policy_chosen_logps | torch.FloatTensor | Yes | Log probs of policy model on chosen responses. Shape: (batch_size,) |
| policy_rejected_logps | torch.FloatTensor | Yes | Log probs of policy model on rejected responses. Shape: (batch_size,) |
| reference_chosen_logps | torch.FloatTensor | Yes | Log probs of reference model on chosen responses. Shape: (batch_size,) |
| reference_rejected_logps | torch.FloatTensor | Yes | Log probs of reference model on rejected responses. Shape: (batch_size,) |
| beta | float | Yes | DPO temperature parameter (typically 0.1 to 0.5) |
| label_smoothing | float | No | Conservative DPO noise (default 0.0; range 0 to 0.5) |
| ipo | bool | No | Use IPO loss variant (default False) |
| reference_free | bool | No | Ignore reference model, use uniform reference (default False) |
Outputs
| Name | Type | Description |
|---|---|---|
| losses | torch.FloatTensor | Per-example DPO loss. Shape: (batch_size,) |
| chosen_rewards | torch.FloatTensor | Implicit reward for chosen responses: beta * (policy_logp - ref_logp). Shape: (batch_size,) |
| rejected_rewards | torch.FloatTensor | Implicit reward for rejected responses: beta * (policy_logp - ref_logp). Shape: (batch_size,) |
Usage Examples
Standard DPO Loss
from trainers import preference_loss
# After obtaining log probabilities from concatenated_forward
losses, chosen_rewards, rejected_rewards = preference_loss(
policy_chosen_logps,
policy_rejected_logps,
reference_chosen_logps,
reference_rejected_logps,
beta=0.1,
)
# Compute reward accuracy
reward_accuracies = (chosen_rewards > rejected_rewards).float()
loss = losses.mean()
loss.backward()
Conservative DPO (cDPO)
# With label smoothing for noisy preferences
losses, chosen_rewards, rejected_rewards = preference_loss(
policy_chosen_logps,
policy_rejected_logps,
reference_chosen_logps,
reference_rejected_logps,
beta=0.1,
label_smoothing=0.1, # assume 10% of preferences are flipped
)
IPO Variant
# Identity Preference Optimization
losses, chosen_rewards, rejected_rewards = preference_loss(
policy_chosen_logps,
policy_rejected_logps,
reference_chosen_logps,
reference_rejected_logps,
beta=0.1,
ipo=True,
)