Principle:Lucidrains X transformers Direct Preference Optimization
Principle: Direct_Preference_Optimization
Metadata
| Field | Value |
|---|---|
| Page Type | Principle |
| Knowledge Sources | Paper (Direct Preference Optimization — Rafailov et al.), Repo (x-transformers) |
| Domains | Deep_Learning, NLP, Alignment, RLHF |
| Last Updated | 2026-02-08 18:00 GMT |
Overview
Preference-based training algorithm that directly optimizes a language model policy using human preference pairs without requiring a separate reward model.
Description
DPO training takes pairs of (preferred, unpreferred) completions for the same prompt. It computes log probabilities under both the policy and reference models, then optimizes the policy to increase the probability ratio for preferred over unpreferred completions. A prompt mask excludes prompt tokens from the loss computation, ensuring that only the completion tokens contribute to the training signal.
The training procedure operates as follows:
- Step 1 — Reference log probabilities: With
torch.no_grad(), compute the log probabilities of both the preferred and unpreferred sequences under the frozen reference model. - Step 2 — Policy log probabilities: Compute the log probabilities of both sequences under the trainable policy model (gradients flow through this computation).
- Step 3 — Prompt masking: Apply the prompt mask to exclude prompt tokens from the log probability computation. Optionally apply padding masks as well.
- Step 4 — Masked mean: Compute the masked mean of log probabilities over the non-prompt, non-padding tokens for each sequence.
- Step 5 — DPO loss: Compute the log-ratio differences and apply the DPO loss formula.
Usage
Use after initializing the DPO wrapper (see Principle:Lucidrains_X_transformers_DPO_Wrapper_Setup). Requires a preference dataset with:
- Preferred sequences: Tokenized sequences containing the prompt followed by the human-preferred completion.
- Unpreferred sequences: Tokenized sequences containing the same prompt followed by the human-unpreferred completion.
- Prompt masks: Boolean masks where
Trueindicates prompt tokens (which are excluded from the loss).
Both preferred and unpreferred sequences must have the same shape within each batch.
Theoretical Basis
DPO Loss Function
The DPO loss follows Appendix B of Rafailov et al.:
L = -E[log σ(β · ((log π_θ(y_w|x) - log π_θ(y_l|x)) - (log π_ref(y_w|x) - log π_ref(y_l|x))))]
Where:
y_w= preferred completion (winner).y_l= unpreferred completion (loser).π_θ= trainable policy model.π_ref= frozen reference model.β= DPO temperature controlling deviation strength.σ= sigmoid function.
Intuition
The loss can be decomposed into two log-ratio terms:
- Policy log-ratio:
log π_θ(y_w|x) - log π_θ(y_l|x)— how much the policy prefers the winner over the loser. - Reference log-ratio:
log π_ref(y_w|x) - log π_ref(y_l|x)— how much the reference prefers the winner over the loser.
The loss pushes the policy's preference gap to exceed the reference's preference gap. The sigmoid ensures the loss is bounded and well-behaved. The β parameter scales the sensitivity: larger β makes the model more aggressively differentiate between preferred and unpreferred completions.
Log Probability Computation
Log probabilities are computed as masked means over non-prompt tokens. For a sequence of tokens [x_1, ..., x_T], the model predicts P(x_{t+1} | x_1, ..., x_t) at each position. The log probability of the sequence is the mean of log P(x_{t+1} | x_{≤t}) over all non-prompt positions. The masking ensures that prompt tokens (which are identical between preferred and unpreferred sequences) do not influence the loss.