Principle:Allenai Open instruct DPO Loss
| Knowledge Sources | |
|---|---|
| Domains | Machine Learning, Optimization, Preference Learning |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
The DPO loss is a preference optimization objective that directly optimizes a language model policy to align with human preferences, bypassing the need to train an explicit reward model.
Description
Direct Preference Optimization (DPO) reformulates the reinforcement learning from human feedback (RLHF) objective into a simple classification-style loss over preference pairs. Given a prompt , a preferred (chosen) response , and a dispreferred (rejected) response , DPO derives a closed-form loss that implicitly optimizes the same objective as RLHF with a KL-divergence constraint.
The key insight is that the optimal policy under the constrained RLHF objective can be expressed in terms of the reward function and the reference policy. By rearranging this relationship, the reward function can be eliminated entirely, yielding a loss that depends only on the policy and reference model log-probabilities.
Standard DPO Loss:
The standard DPO loss computes the sum of log-probabilities over response tokens, forming log-ratios between the policy and reference models for both chosen and rejected responses, then applies a sigmoid function.
DPO-Norm:
DPO-Norm is a variant that uses the average log-probability (per token) instead of the sum. This normalizes for response length, preventing the model from preferring shorter responses simply because they have higher summed log-probabilities.
SimPO (Simple Preference Optimization):
SimPO removes the reference model entirely, using only the policy's average log-probabilities. It introduces a margin term to create a target gap between chosen and rejected response scores. This eliminates the need to cache or compute reference model logprobs.
WPO (Weighted Preference Optimization):
WPO extends DPO by weighting the loss based on the policy model's confidence. The weight is computed from the average log-probabilities of both chosen and rejected responses, clamped to [0, 1]. This focuses the training signal on examples where the model is less confident.
Label Smoothing:
All variants support label smoothing, which softens the binary preference signal by mixing in a small probability of the "wrong" preference direction:
where is the label smoothing parameter.
Usage
Use the DPO loss when:
- You have paired preference data (chosen vs. rejected responses for the same prompt).
- You want to align a language model without training a separate reward model.
- You need fine-grained control over the loss variant (standard, normalized, reference-free, or weighted).
Theoretical Basis
DPO Derivation:
Starting from the constrained RLHF objective:
the optimal policy is:
Solving for the reward:
Substituting into the Bradley-Terry preference model and noting that cancels:
Implicit Rewards:
The DPO framework also defines implicit reward metrics for monitoring training:
Failed to parse (syntax error): {\displaystyle \text{chosen\_reward} = \beta \left(\log \pi_\theta(y_w|x) - \log \pi_{\text{ref}}(y_w|x)\right)}
Failed to parse (syntax error): {\displaystyle \text{rejected\_reward} = \beta \left(\log \pi_\theta(y_l|x) - \log \pi_{\text{ref}}(y_l|x)\right)}
The reward margin (chosen minus rejected) should increase during training, indicating the policy increasingly prefers the chosen response.
SimPO Loss:
where denotes the length-normalized average log-probability.