Principle:OpenRLHF OpenRLHF Direct Preference Optimization
| Knowledge Sources | |
|---|---|
| Domains | NLP, Alignment, Training |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
An alignment method that directly optimizes a language model policy from preference data without training an explicit reward model.
Description
Direct Preference Optimization (DPO) reformulates the RLHF objective to eliminate the need for a separate reward model and RL training loop. It derives a closed-form solution for the optimal policy under a KL-constrained reward maximization objective, then directly optimizes the policy using a binary cross-entropy loss over preference pairs.
DPO requires a frozen reference model (typically the SFT model) and a policy model that is trained to increase the log-probability ratio of chosen over rejected responses relative to the reference model.
Usage
Use DPO when you have preference data but want to avoid the complexity of training a separate reward model and running PPO. DPO is simpler and often more stable than PPO, though it requires paired preference data. Also used in iterative DPO loops with on-policy data generation.
Theoretical Basis
DPO starts from the RLHF objective and derives the implicit reward:
Substituting into the Bradley-Terry preference model gives the DPO loss:
Variants supported:
- Standard DPO: The loss above
- cDPO: Conservative DPO with label smoothing
- IPO: Identity Preference Optimization - replaces sigmoid with squared loss