Principle: Direct_Preference_Optimization

Metadata

Field	Value
Page Type	Principle
Knowledge Sources	Paper (Direct Preference Optimization — Rafailov et al.), Repo (x-transformers)
Domains	Deep_Learning, NLP, Alignment, RLHF
Last Updated	2026-02-08 18:00 GMT

Overview

Preference-based training algorithm that directly optimizes a language model policy using human preference pairs without requiring a separate reward model.

Description

DPO training takes pairs of (preferred, unpreferred) completions for the same prompt. It computes log probabilities under both the policy and reference models, then optimizes the policy to increase the probability ratio for preferred over unpreferred completions. A prompt mask excludes prompt tokens from the loss computation, ensuring that only the completion tokens contribute to the training signal.

The training procedure operates as follows:

Step 1 — Reference log probabilities: With torch.no_grad(), compute the log probabilities of both the preferred and unpreferred sequences under the frozen reference model.
Step 2 — Policy log probabilities: Compute the log probabilities of both sequences under the trainable policy model (gradients flow through this computation).
Step 3 — Prompt masking: Apply the prompt mask to exclude prompt tokens from the log probability computation. Optionally apply padding masks as well.
Step 4 — Masked mean: Compute the masked mean of log probabilities over the non-prompt, non-padding tokens for each sequence.
Step 5 — DPO loss: Compute the log-ratio differences and apply the DPO loss formula.

Usage

Use after initializing the DPO wrapper (see Principle:Lucidrains_X_transformers_DPO_Wrapper_Setup). Requires a preference dataset with:

Preferred sequences: Tokenized sequences containing the prompt followed by the human-preferred completion.
Unpreferred sequences: Tokenized sequences containing the same prompt followed by the human-unpreferred completion.
Prompt masks: Boolean masks where True indicates prompt tokens (which are excluded from the loss).

Both preferred and unpreferred sequences must have the same shape within each batch.

Theoretical Basis

DPO Loss Function

The DPO loss follows Appendix B of Rafailov et al.:

L = -E[log σ(β · ((log π_θ(y_w|x) - log π_θ(y_l|x)) - (log π_ref(y_w|x) - log π_ref(y_l|x))))]

Where:

y_w = preferred completion (winner).
y_l = unpreferred completion (loser).
π_θ = trainable policy model.
π_ref = frozen reference model.
β = DPO temperature controlling deviation strength.
σ = sigmoid function.

Intuition

The loss can be decomposed into two log-ratio terms:

Policy log-ratio: log π_θ(y_w|x) - log π_θ(y_l|x) — how much the policy prefers the winner over the loser.
Reference log-ratio: log π_ref(y_w|x) - log π_ref(y_l|x) — how much the reference prefers the winner over the loser.

The loss pushes the policy's preference gap to exceed the reference's preference gap. The sigmoid ensures the loss is bounded and well-behaved. The β parameter scales the sensitivity: larger β makes the model more aggressively differentiate between preferred and unpreferred completions.

Log Probability Computation

Log probabilities are computed as masked means over non-prompt tokens. For a sequence of tokens [x_1, ..., x_T], the model predicts P(x_{t+1} | x_1, ..., x_t) at each position. The log probability of the sequence is the mean of log P(x_{t+1} | x_{≤t}) over all non-prompt positions. The masking ensures that prompt tokens (which are identical between preferred and unpreferred sequences) do not influence the loss.

Related Pages

Implemented By

Implementation:Lucidrains_X_transformers_DPO_Forward

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment