Principle:Axolotl ai cloud Axolotl DPO Training Execution
| Knowledge Sources | |
|---|---|
| Domains | Alignment, Reinforcement_Learning, Training |
| Last Updated | 2026-02-06 23:00 GMT |
Overview
A training execution pattern that optimizes a language model to align with human preferences using paired chosen/rejected responses without explicit reward modeling.
Description
Direct Preference Optimization (DPO) training bypasses the traditional RLHF pipeline (reward model + PPO) by directly optimizing the policy model on preference pairs. The DPO loss function implicitly defines a reward through the log-probability ratio between the policy and reference models.
In Axolotl, DPO training is handled by HFRLTrainerBuilder which constructs an AxolotlDPOTrainer (extending TRL's DPOTrainer). The builder configures DPO-specific training arguments via DPOStrategy, which sets the loss type (DPO/IPO), label smoothing, max length, and evaluation generation settings.
Axolotl supports multiple DPO variants: standard DPO, IPO (which uses a different loss function), SimPO (reference-free), and ORPO (odds ratio).
Usage
Use DPO training execution when:
- Aligning a model with human preferences
- Having paired chosen/rejected response data
- Preferring a simpler approach than full RLHF (reward model + PPO)
- The model has already been instruction-tuned (SFT) and needs alignment
Theoretical Basis
DPO Loss:
IPO Loss (alternative):
Key hyperparameters:
- beta: Controls KL penalty strength. Lower = more divergence allowed
- label_smoothing: Soft labels for noisy preference data
- generate_during_eval: Generate completions during evaluation for qualitative assessment