Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Eric mitchell Direct preference optimization DPO Loss

From Leeroopedia


Knowledge Sources
Domains Reinforcement_Learning_from_Human_Feedback, Preference_Optimization, NLP
Last Updated 2026-02-08 02:00 GMT

Overview

A preference optimization objective that directly optimizes a language model policy from human preference data without requiring a separate reward model or reinforcement learning loop.

Description

Direct Preference Optimization (DPO) reformulates the RLHF objective to bypass the reward modeling and RL stages entirely. Instead of first training a reward model on preferences and then optimizing the policy against that reward using PPO, DPO derives a closed-form mapping between the optimal policy and the reward function under a KL-constrained objective. This allows direct optimization of the policy using a simple classification-like loss on preference pairs.

The key insight is that the optimal policy under a KL-divergence constraint from a reference policy can be expressed analytically. By substituting this closed-form solution back into the Bradley-Terry preference model, the reward function cancels out, yielding a loss that depends only on the log probabilities of the policy and reference models on chosen and rejected responses.

DPO also supports two important variants:

  • Conservative DPO (cDPO): Adds label smoothing to handle noisy preference labels, assuming some fraction of preferences are flipped.
  • Identity Preference Optimization (IPO): Replaces the sigmoid loss with a squared-error penalty on the log-ratio margin, providing a different regularization behavior.

Usage

Use this principle when training language models to align with human preferences, particularly when:

  • You have a dataset of (prompt, chosen_response, rejected_response) triples
  • You want to avoid the complexity and instability of PPO-based RLHF
  • You have a pre-trained SFT model to serve as the reference policy
  • You need a simple, stable training objective that can scale to large models

Theoretical Basis

The DPO loss derives from the constrained optimization problem:

maxπθ𝔼xD,yπθ(y|x)[r(x,y)]βDKL[πθ(y|x)πref(y|x)]

The optimal policy has the closed-form solution:

π*(y|x)=1Z(x)πref(y|x)exp(1βr(x,y))

Substituting into the Bradley-Terry preference model p(ywyl|x)=σ(r(x,yw)r(x,yl)) yields the DPO loss (Eq. 7 of the paper):

DPO(πθ;πref)=𝔼(x,yw,yl)D[logσ(βlogπθ(yw|x)πref(yw|x)βlogπθ(yl|x)πref(yl|x))]

Conservative DPO extends this with label smoothing (Eq. 3 of cDPO paper):

cDPO=(1ϵ)logσ(βh)ϵlogσ(βh)

where h=logπθ(yw|x)πref(yw|x)logπθ(yl|x)πref(yl|x) and ϵ is the label smoothing parameter.

IPO uses a squared-error loss (Eq. 17 of IPO paper):

IPO=(h12β)2

Pseudo-code:

# Abstract DPO algorithm (NOT actual implementation)
pi_logratios = log_pi(y_w) - log_pi(y_l)
ref_logratios = log_ref(y_w) - log_ref(y_l)
h = pi_logratios - ref_logratios
loss = -log_sigmoid(beta * h)  # standard DPO
# or: loss = (h - 1/(2*beta))^2  # IPO variant

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment