Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Allenai Open instruct DPO Loss

From Leeroopedia


Knowledge Sources
Domains Machine Learning, Optimization, Preference Learning
Last Updated 2026-02-07 00:00 GMT

Overview

The DPO loss is a preference optimization objective that directly optimizes a language model policy to align with human preferences, bypassing the need to train an explicit reward model.

Description

Direct Preference Optimization (DPO) reformulates the reinforcement learning from human feedback (RLHF) objective into a simple classification-style loss over preference pairs. Given a prompt x, a preferred (chosen) response yw, and a dispreferred (rejected) response yl, DPO derives a closed-form loss that implicitly optimizes the same objective as RLHF with a KL-divergence constraint.

The key insight is that the optimal policy under the constrained RLHF objective can be expressed in terms of the reward function and the reference policy. By rearranging this relationship, the reward function can be eliminated entirely, yielding a loss that depends only on the policy and reference model log-probabilities.

Standard DPO Loss:

The standard DPO loss computes the sum of log-probabilities over response tokens, forming log-ratios between the policy and reference models for both chosen and rejected responses, then applies a sigmoid function.

DPO-Norm:

DPO-Norm is a variant that uses the average log-probability (per token) instead of the sum. This normalizes for response length, preventing the model from preferring shorter responses simply because they have higher summed log-probabilities.

SimPO (Simple Preference Optimization):

SimPO removes the reference model entirely, using only the policy's average log-probabilities. It introduces a margin term γ to create a target gap between chosen and rejected response scores. This eliminates the need to cache or compute reference model logprobs.

WPO (Weighted Preference Optimization):

WPO extends DPO by weighting the loss based on the policy model's confidence. The weight is computed from the average log-probabilities of both chosen and rejected responses, clamped to [0, 1]. This focuses the training signal on examples where the model is less confident.

Label Smoothing:

All variants support label smoothing, which softens the binary preference signal by mixing in a small probability of the "wrong" preference direction:

=(1ϵ)logσ(βlogits)ϵlogσ(βlogits)

where ϵ is the label smoothing parameter.

Usage

Use the DPO loss when:

  • You have paired preference data (chosen vs. rejected responses for the same prompt).
  • You want to align a language model without training a separate reward model.
  • You need fine-grained control over the loss variant (standard, normalized, reference-free, or weighted).

Theoretical Basis

DPO Derivation:

Starting from the constrained RLHF objective:

maxπ𝔼x𝒟,yπ(|x)[r(x,y)]βKL[π(|x)πref(|x)]

the optimal policy is:

π*(y|x)=1Z(x)πref(y|x)exp(r(x,y)β)

Solving for the reward:

r(x,y)=βlogπ*(y|x)πref(y|x)+βlogZ(x)

Substituting into the Bradley-Terry preference model and noting that Z(x) cancels:

DPO(πθ;πref)=𝔼[logσ(β(logπθ(yw|x)πref(yw|x)logπθ(yl|x)πref(yl|x)))]

Implicit Rewards:

The DPO framework also defines implicit reward metrics for monitoring training:

Failed to parse (syntax error): {\displaystyle \text{chosen\_reward} = \beta \left(\log \pi_\theta(y_w|x) - \log \pi_{\text{ref}}(y_w|x)\right)}

Failed to parse (syntax error): {\displaystyle \text{rejected\_reward} = \beta \left(\log \pi_\theta(y_l|x) - \log \pi_{\text{ref}}(y_l|x)\right)}

The reward margin (chosen minus rejected) should increase during training, indicating the policy increasingly prefers the chosen response.

SimPO Loss:

SimPO=logσ(β(logπθ(yw|x)logπθ(yl|x)γ/β))

where logπθ(y|x) denotes the length-normalized average log-probability.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment