Principle:Huggingface Alignment handbook Direct Preference Optimization
| Knowledge Sources | |
|---|---|
| Domains | NLP, Deep_Learning, Reinforcement_Learning |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
An alignment algorithm that directly optimizes a language model's policy using human preference data without requiring a separate reward model or reinforcement learning loop.
Description
Direct Preference Optimization (DPO) is an alternative to RLHF that eliminates the need for training a reward model and running PPO. Instead, DPO reparameterizes the reward function to derive a closed-form loss that directly optimizes the policy model using preference pairs (chosen vs. rejected responses).
DPO addresses the complexity and instability of the traditional RLHF pipeline (SFT → Reward Model → PPO) by showing that the optimal policy under the Bradley-Terry preference model can be learned with a simple binary cross-entropy-like loss. This makes DPO simpler to implement, more stable to train, and computationally cheaper than PPO-based RLHF.
In the alignment-handbook, DPO is the second stage of the alignment pipeline, applied after SFT. The DPO trainer requires both a policy model (the model being optimized) and a frozen reference model (typically the SFT checkpoint) to compute the implicit reward.
Usage
Use DPO when:
- You have preference data (chosen/rejected response pairs for the same prompts)
- You want to improve model alignment beyond what SFT achieves
- You prefer a simpler alternative to PPO-based RLHF
- A reference model (usually the SFT checkpoint) is available
Theoretical Basis
The DPO loss is derived from the Bradley-Terry preference model:
Where:
- is the policy model being trained
- is the frozen reference model (SFT checkpoint)
- is the chosen (preferred) response
- is the rejected response
- is a temperature parameter controlling deviation from the reference policy
- is the sigmoid function
# Abstract DPO algorithm (NOT real implementation)
for prompt, chosen, rejected in preference_data:
log_ratio_chosen = log_prob(policy, chosen) - log_prob(ref_model, chosen)
log_ratio_rejected = log_prob(policy, rejected) - log_prob(ref_model, rejected)
loss = -log_sigmoid(beta * (log_ratio_chosen - log_ratio_rejected))
loss.backward()
optimizer.step()
The beta parameter (typically 0.01-0.1) controls how much the policy can deviate from the reference model. Lower values allow more deviation; higher values keep the policy closer to the reference.