Principle:Hpcaitech ColossalAI DPO Training
| Knowledge Sources | |
|---|---|
| Domains | NLP, Reinforcement_Learning |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
A preference alignment algorithm that directly optimizes a language model's policy using human preference data without requiring a separate reward model.
Description
Direct Preference Optimization (DPO) reformulates the RLHF objective as a classification problem on preference pairs. Instead of training a reward model and then using RL (like PPO), DPO directly increases the probability of chosen responses relative to rejected responses, with a KL divergence penalty against a frozen reference model to prevent distribution collapse.
The training loop requires maintaining two models: a trainable policy model and a frozen reference model (typically a copy of the initial policy). For each batch, both models compute log probabilities on chosen and rejected sequences, and the DPO loss is computed from the difference in log probability ratios.
Usage
Use DPO when you have human preference data (chosen/rejected pairs) and want to align a model without the complexity of training a separate reward model. DPO is simpler and more stable than PPO-based RLHF.
Theoretical Basis
The DPO loss function:
Where:
- is the policy model being trained
- is the frozen reference model
- is the chosen (winning) response
- is the rejected (losing) response
- is the temperature parameter controlling deviation from reference