Principle:Allenai Open instruct DPO Loss

Knowledge Sources	DPO SimPO WPO
Domains	Machine Learning, Optimization, Preference Learning
Last Updated	2026-02-07 00:00 GMT

Overview

The DPO loss is a preference optimization objective that directly optimizes a language model policy to align with human preferences, bypassing the need to train an explicit reward model.

Description

Direct Preference Optimization (DPO) reformulates the reinforcement learning from human feedback (RLHF) objective into a simple classification-style loss over preference pairs. Given a prompt $x$ , a preferred (chosen) response $y_{w}$ , and a dispreferred (rejected) response $y_{l}$ , DPO derives a closed-form loss that implicitly optimizes the same objective as RLHF with a KL-divergence constraint.

The key insight is that the optimal policy under the constrained RLHF objective can be expressed in terms of the reward function and the reference policy. By rearranging this relationship, the reward function can be eliminated entirely, yielding a loss that depends only on the policy and reference model log-probabilities.

Standard DPO Loss:

The standard DPO loss computes the sum of log-probabilities over response tokens, forming log-ratios between the policy and reference models for both chosen and rejected responses, then applies a sigmoid function.

DPO-Norm:

DPO-Norm is a variant that uses the average log-probability (per token) instead of the sum. This normalizes for response length, preventing the model from preferring shorter responses simply because they have higher summed log-probabilities.

SimPO (Simple Preference Optimization):

SimPO removes the reference model entirely, using only the policy's average log-probabilities. It introduces a margin term $γ$ to create a target gap between chosen and rejected response scores. This eliminates the need to cache or compute reference model logprobs.

WPO (Weighted Preference Optimization):

WPO extends DPO by weighting the loss based on the policy model's confidence. The weight is computed from the average log-probabilities of both chosen and rejected responses, clamped to [0, 1]. This focuses the training signal on examples where the model is less confident.

Label Smoothing:

All variants support label smoothing, which softens the binary preference signal by mixing in a small probability of the "wrong" preference direction:

$ℒ = - (1 - ϵ) \log σ (β \cdot logits) - ϵ \log σ (- β \cdot logits)$

where $ϵ$ is the label smoothing parameter.

Usage

Use the DPO loss when:

You have paired preference data (chosen vs. rejected responses for the same prompt).
You want to align a language model without training a separate reward model.
You need fine-grained control over the loss variant (standard, normalized, reference-free, or weighted).

Theoretical Basis

DPO Derivation:

Starting from the constrained RLHF objective:

$\max_{π} 𝔼_{x \sim 𝒟, y \sim π (\cdot | x)} [r (x, y)] - β KL [π (\cdot | x) ‖ π_{ref} (\cdot | x)]$

the optimal policy is:

$π^{*} (y | x) = \frac{1}{Z (x)} π_{ref} (y | x) \exp (\frac{r (x, y)}{β})$

Solving for the reward:

$r (x, y) = β \log \frac{π^{*} (y | x)}{π_{ref} (y | x)} + β \log Z (x)$

Substituting into the Bradley-Terry preference model and noting that $Z (x)$ cancels:

$ℒ_{DPO} (π_{θ}; π_{ref}) = - 𝔼 [\log σ (β (\log \frac{π_{θ} (y_{w} | x)}{π_{ref} (y_{w} | x)} - \log \frac{π_{θ} (y_{l} | x)}{π_{ref} (y_{l} | x)}))]$

Implicit Rewards:

The DPO framework also defines implicit reward metrics for monitoring training:

Failed to parse (syntax error): {\displaystyle \text{chosen\_reward} = \beta \left(\log \pi_\theta(y_w|x) - \log \pi_{\text{ref}}(y_w|x)\right)}

Failed to parse (syntax error): {\displaystyle \text{rejected\_reward} = \beta \left(\log \pi_\theta(y_l|x) - \log \pi_{\text{ref}}(y_l|x)\right)}

The reward margin (chosen minus rejected) should increase during training, indicating the policy increasingly prefers the chosen response.

SimPO Loss:

$ℒ_{SimPO} = - \log σ (β (\overline{\log π_{θ} (y_{w} | x)} - \overline{\log π_{θ} (y_{l} | x)} - γ / β))$

where $\overline{\log π_{θ} (y | x)}$ denotes the length-normalized average log-probability.

Related Pages

Implemented By

Implementation:Allenai_Open_instruct_DPO_Loss_Function

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment