Principle:Hiyouga LLaMA Factory Direct Preference Optimization
| Knowledge Sources | |
|---|---|
| Domains | Natural Language Processing, Language Model Alignment, Preference Learning |
| Last Updated | 2026-02-06 19:00 GMT |
Overview
A preference-based alignment technique that optimizes a language model directly on human preference data without requiring a separate reward model or reinforcement learning loop.
Description
Direct Preference Optimization (DPO) is an alignment method introduced by Rafailov et al. (2023) that reframes the RLHF objective as a simple classification problem over preference pairs. Rather than first training a reward model and then using PPO to optimize the policy against that reward, DPO derives a closed-form mapping between the optimal policy and the reward function. This allows the preference loss to be expressed directly in terms of the policy model's log-probabilities.
DPO is significant in the ML landscape because it:
- Eliminates the reward model: No separate reward model needs to be trained or maintained during alignment.
- Removes RL complexity: No PPO clipping, value function estimation, or advantage computation is needed.
- Maintains stability: The implicit KL-divergence constraint against a reference model prevents the policy from deviating too far from the pretrained distribution.
- Supports multiple loss variants: The framework naturally extends to IPO (identity preference optimization), ORPO (odds ratio preference optimization), SimPO (simple preference optimization), and BCO (binary classifier optimization).
The method requires pairwise preference data where, for each prompt, a chosen (preferred) response and a rejected (dispreferred) response are provided.
Usage
Use DPO when you want to:
- Align a language model to human preferences after supervised fine-tuning.
- Avoid the complexity and instability of PPO-based RLHF.
- Work with pairwise preference datasets (chosen vs. rejected responses).
- Experiment with preference optimization variants (IPO, ORPO, SimPO) using a unified framework.
DPO is most effective when high-quality pairwise preference data is available and the SFT model already produces reasonable outputs.
Theoretical Basis
Core DPO Objective
DPO starts from the observation that the optimal policy under a KL-constrained reward maximization objective satisfies:
where is the implicit reward, is the policy, is the reference model, is the KL penalty coefficient, and is the partition function. Substituting this into the Bradley-Terry preference model yields the DPO loss:
where is the chosen (winning) response, is the rejected (losing) response, and is the sigmoid function.
Loss Variants
The framework supports several loss types:
IPO (Identity Preference Optimization) uses average log-probabilities rather than summed log-probabilities:
ORPO (Odds Ratio Preference Optimization) is reference-free and combines SFT with an odds ratio penalty:
SimPO (Simple Preference Optimization) is reference-free and uses a length-normalized margin:
where is a target reward margin and denotes length-normalized log-probabilities.
Auxiliary SFT Loss
An optional auxiliary SFT loss on the chosen responses can be added to prevent catastrophic forgetting:
where controls the weight of the SFT regularization term.
Reference Model
When a reference model is used (use_ref_model=True), DPO computes log-probabilities from both the policy and a frozen copy of the original model. When using LoRA, the reference model can be implicitly obtained by disabling the adapter layers, avoiding the need to load a separate model into memory.