Principle:LLMBook zh LLMBook zh github io Direct Preference Optimization
| Knowledge Sources | |
|---|---|
| Domains | NLP, Alignment, Optimization |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
An alignment technique that directly optimizes a language model on preference data without training a separate reward model or using reinforcement learning.
Description
Direct Preference Optimization (DPO) reformulates the RLHF objective into a simple classification loss over preference pairs. Instead of the traditional three-step RLHF pipeline (train reward model, then use PPO to optimize), DPO derives a closed-form loss that directly increases the likelihood of preferred responses relative to rejected ones, regularized by a KL divergence from a reference model.
DPO achieves comparable or superior results to PPO-based RLHF while being simpler to implement, more stable to train, and computationally lighter.
Usage
Use DPO when aligning a language model to human preferences. It requires a preference dataset with (prompt, chosen, rejected) triples, a policy model, and a frozen reference model. DPO is preferred over PPO when simplicity and training stability are priorities.
Theoretical Basis
The DPO loss is:
Where:
- is the policy model being trained
- is the frozen reference model
- is the preferred response, is the rejected response
- controls how much the policy can diverge from the reference (default 0.1)
Higher means less divergence from the reference model.