Principle:Hiyouga LLaMA Factory Direct Preference Optimization

Knowledge Sources	Hiyouga_LLaMA_Factory Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Domains	Natural Language Processing, Language Model Alignment, Preference Learning
Last Updated	2026-02-06 19:00 GMT

Overview

A preference-based alignment technique that optimizes a language model directly on human preference data without requiring a separate reward model or reinforcement learning loop.

Description

Direct Preference Optimization (DPO) is an alignment method introduced by Rafailov et al. (2023) that reframes the RLHF objective as a simple classification problem over preference pairs. Rather than first training a reward model and then using PPO to optimize the policy against that reward, DPO derives a closed-form mapping between the optimal policy and the reward function. This allows the preference loss to be expressed directly in terms of the policy model's log-probabilities.

DPO is significant in the ML landscape because it:

Eliminates the reward model: No separate reward model needs to be trained or maintained during alignment.
Removes RL complexity: No PPO clipping, value function estimation, or advantage computation is needed.
Maintains stability: The implicit KL-divergence constraint against a reference model prevents the policy from deviating too far from the pretrained distribution.
Supports multiple loss variants: The framework naturally extends to IPO (identity preference optimization), ORPO (odds ratio preference optimization), SimPO (simple preference optimization), and BCO (binary classifier optimization).

The method requires pairwise preference data where, for each prompt, a chosen (preferred) response and a rejected (dispreferred) response are provided.

Usage

Use DPO when you want to:

Align a language model to human preferences after supervised fine-tuning.
Avoid the complexity and instability of PPO-based RLHF.
Work with pairwise preference datasets (chosen vs. rejected responses).
Experiment with preference optimization variants (IPO, ORPO, SimPO) using a unified framework.

DPO is most effective when high-quality pairwise preference data is available and the SFT model already produces reasonable outputs.

Theoretical Basis

Core DPO Objective

DPO starts from the observation that the optimal policy under a KL-constrained reward maximization objective satisfies:

$r (x, y) = β \log \frac{π_{θ} (y ∣ x)}{π_{ref} (y ∣ x)} + β \log Z (x)$

where $r (x, y)$ is the implicit reward, $π_{θ}$ is the policy, $π_{ref}$ is the reference model, $β$ is the KL penalty coefficient, and $Z (x)$ is the partition function. Substituting this into the Bradley-Terry preference model yields the DPO loss:

$ℒ_{DPO} (θ) = - 𝔼_{(x, y_{w}, y_{l})} [\log σ (β \log \frac{π_{θ} (y_{w} ∣ x)}{π_{ref} (y_{w} ∣ x)} - β \log \frac{π_{θ} (y_{l} ∣ x)}{π_{ref} (y_{l} ∣ x)})]$

where $y_{w}$ is the chosen (winning) response, $y_{l}$ is the rejected (losing) response, and $σ$ is the sigmoid function.

Loss Variants

The framework supports several loss types:

IPO (Identity Preference Optimization) uses average log-probabilities rather than summed log-probabilities:

$ℒ_{IPO} = {(\log \frac{π_{θ} (y_{w} ∣ x)}{π_{ref} (y_{w} ∣ x)} - \log \frac{π_{θ} (y_{l} ∣ x)}{π_{ref} (y_{l} ∣ x)} - \frac{1}{2 β})}^{2}$

ORPO (Odds Ratio Preference Optimization) is reference-free and combines SFT with an odds ratio penalty:

$ℒ_{ORPO} = - \log π_{θ} (y_{w} ∣ x) + β \cdot (- \log σ (\log \frac{odds (y_{w})}{odds (y_{l})}))$

SimPO (Simple Preference Optimization) is reference-free and uses a length-normalized margin:

$ℒ_{SimPO} = - \log σ (β (\bar{r} (y_{w}) - \bar{r} (y_{l}) - γ))$

where $γ$ is a target reward margin and $\bar{r}$ denotes length-normalized log-probabilities.

Auxiliary SFT Loss

An optional auxiliary SFT loss on the chosen responses can be added to prevent catastrophic forgetting:

$ℒ_{total} = ℒ_{DPO} + γ_{ftx} \cdot ℒ_{SFT} (y_{w})$

where $γ_{ftx}$ controls the weight of the SFT regularization term.

Reference Model

When a reference model is used (use_ref_model=True), DPO computes log-probabilities from both the policy and a frozen copy of the original model. When using LoRA, the reference model can be implicitly obtained by disabling the adapter layers, avoiding the need to load a separate model into memory.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment