Principle:Huggingface Trl DPO Training
| Knowledge Sources | |
|---|---|
| Domains | NLP, RLHF |
| Last Updated | 2026-02-06 17:00 GMT |
Overview
Direct Preference Optimization learns from human preference pairs without requiring a separate reward model by using a closed-form loss derived from the constrained RLHF objective.
Description
DPO training is the core optimization loop that adjusts the policy model's parameters to better align with human preferences. Unlike the traditional RLHF pipeline (which trains a reward model, then optimizes the policy via PPO), DPO directly optimizes the policy using a supervised-style loss on preference pairs.
The training procedure works as follows:
- Forward pass on concatenated batch: For each training batch, the chosen and rejected completions are concatenated along the batch dimension (doubling the effective batch size) and passed through the model in a single forward pass. This is more efficient than two separate forward passes, especially for FSDP training.
- Log probability computation: Per-token log probabilities are computed for both chosen and rejected completions using
selective_log_softmax. The prompt tokens are masked out (loss is only computed over completion tokens). The per-token log probs are summed to get sequence-level log probabilities.
- Reference log probability computation: Reference model log probabilities are either retrieved from precomputed values or computed on-the-fly using the reference model (or by disabling PEFT adapters).
- Loss computation: The DPO loss computes the log-probability ratio between policy and reference for both chosen and rejected responses, then applies the selected loss function (sigmoid, IPO, hinge, etc.) with the beta temperature parameter.
- Multi-loss combination: Multiple loss types can be combined with configurable weights, enabling approaches like MPO which blends sigmoid DPO loss with SFT loss.
- Gradient update: Standard gradient descent with gradient accumulation, gradient checkpointing, and mixed-precision training.
Key features of the DPOTrainer:
- Liger kernel support: For supported loss types (sigmoid, apo_zero, apo_down, sppo_hard, nca_pair), the Liger fused linear DPO loss kernel can be used for improved performance.
- Padding-free training: Sequences can be flattened into a single continuous sequence per batch, eliminating padding overhead (requires Flash Attention 2).
- WPO weighting: Optional per-sample loss weighting based on the WPO paper.
- Length desensitization (LD-DPO): Optional weighting that separates "public" (shared length) and "verbose" (extra length) portions of responses.
Usage
Use DPO training when:
- Aligning a language model with human preferences
- You have a dataset of preference pairs (chosen/rejected responses)
- You want a simpler alternative to the full RLHF pipeline (reward model + PPO)
- You need to combine multiple loss objectives (MPO-style training)
- You want to experiment with different preference optimization formulations
Theoretical Basis
The DPO loss is derived from the closed-form solution to the KL-constrained reward maximization problem in RLHF. Starting from the RLHF objective:
max_{pi} E_{x~D, y~pi}[r(x,y)] - beta * KL(pi || pi_ref)
The optimal solution is:
pi*(y|x) = (1/Z(x)) * pi_ref(y|x) * exp(r(x,y) / beta)
Reparameterizing the reward in terms of the optimal policy:
r(x,y) = beta * log(pi*(y|x) / pi_ref(y|x)) + beta * log Z(x)
Substituting into the Bradley-Terry preference model and simplifying:
L_DPO = -E[ log sigma( beta * (log(pi_theta(y_w|x)/pi_ref(y_w|x)) - log(pi_theta(y_l|x)/pi_ref(y_l|x))) ) ]
The key loss variants and their formulations:
Sigmoid (standard DPO):
L = -log sigma(beta * logits) * (1 - epsilon) - log sigma(-beta * logits) * epsilon
where logits = log(pi/pi_ref)(y_w) - log(pi/pi_ref)(y_l)
IPO (Identity Preference Optimization):
L = (logits - 1/(2*tau))^2
where tau = beta, and logits are normalized by sequence length
Hinge (SLiC):
L = max(0, 1 - beta * logits)
NCA pair:
L = -log sigma(beta * r_w) - 0.5 * log sigma(-beta * r_w) - 0.5 * log sigma(-beta * r_l)
where r_w = log(pi/pi_ref)(y_w), r_l = log(pi/pi_ref)(y_l)
DiscoPOP:
alpha = sigma(beta * logits / tau)
L = -log sigma(beta * logits) * (1 - alpha) + exp(-beta * logits) * alpha