Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Huggingface Trl DPO Training

From Leeroopedia


Knowledge Sources
Domains NLP, RLHF
Last Updated 2026-02-06 17:00 GMT

Overview

Direct Preference Optimization learns from human preference pairs without requiring a separate reward model by using a closed-form loss derived from the constrained RLHF objective.

Description

DPO training is the core optimization loop that adjusts the policy model's parameters to better align with human preferences. Unlike the traditional RLHF pipeline (which trains a reward model, then optimizes the policy via PPO), DPO directly optimizes the policy using a supervised-style loss on preference pairs.

The training procedure works as follows:

  1. Forward pass on concatenated batch: For each training batch, the chosen and rejected completions are concatenated along the batch dimension (doubling the effective batch size) and passed through the model in a single forward pass. This is more efficient than two separate forward passes, especially for FSDP training.
  1. Log probability computation: Per-token log probabilities are computed for both chosen and rejected completions using selective_log_softmax. The prompt tokens are masked out (loss is only computed over completion tokens). The per-token log probs are summed to get sequence-level log probabilities.
  1. Reference log probability computation: Reference model log probabilities are either retrieved from precomputed values or computed on-the-fly using the reference model (or by disabling PEFT adapters).
  1. Loss computation: The DPO loss computes the log-probability ratio between policy and reference for both chosen and rejected responses, then applies the selected loss function (sigmoid, IPO, hinge, etc.) with the beta temperature parameter.
  1. Multi-loss combination: Multiple loss types can be combined with configurable weights, enabling approaches like MPO which blends sigmoid DPO loss with SFT loss.
  1. Gradient update: Standard gradient descent with gradient accumulation, gradient checkpointing, and mixed-precision training.

Key features of the DPOTrainer:

  • Liger kernel support: For supported loss types (sigmoid, apo_zero, apo_down, sppo_hard, nca_pair), the Liger fused linear DPO loss kernel can be used for improved performance.
  • Padding-free training: Sequences can be flattened into a single continuous sequence per batch, eliminating padding overhead (requires Flash Attention 2).
  • WPO weighting: Optional per-sample loss weighting based on the WPO paper.
  • Length desensitization (LD-DPO): Optional weighting that separates "public" (shared length) and "verbose" (extra length) portions of responses.

Usage

Use DPO training when:

  • Aligning a language model with human preferences
  • You have a dataset of preference pairs (chosen/rejected responses)
  • You want a simpler alternative to the full RLHF pipeline (reward model + PPO)
  • You need to combine multiple loss objectives (MPO-style training)
  • You want to experiment with different preference optimization formulations

Theoretical Basis

The DPO loss is derived from the closed-form solution to the KL-constrained reward maximization problem in RLHF. Starting from the RLHF objective:

max_{pi} E_{x~D, y~pi}[r(x,y)] - beta * KL(pi || pi_ref)

The optimal solution is:

pi*(y|x) = (1/Z(x)) * pi_ref(y|x) * exp(r(x,y) / beta)

Reparameterizing the reward in terms of the optimal policy:

r(x,y) = beta * log(pi*(y|x) / pi_ref(y|x)) + beta * log Z(x)

Substituting into the Bradley-Terry preference model and simplifying:

L_DPO = -E[ log sigma( beta * (log(pi_theta(y_w|x)/pi_ref(y_w|x)) - log(pi_theta(y_l|x)/pi_ref(y_l|x))) ) ]

The key loss variants and their formulations:

Sigmoid (standard DPO):

L = -log sigma(beta * logits) * (1 - epsilon) - log sigma(-beta * logits) * epsilon
where logits = log(pi/pi_ref)(y_w) - log(pi/pi_ref)(y_l)

IPO (Identity Preference Optimization):

L = (logits - 1/(2*tau))^2
where tau = beta, and logits are normalized by sequence length

Hinge (SLiC):

L = max(0, 1 - beta * logits)

NCA pair:

L = -log sigma(beta * r_w) - 0.5 * log sigma(-beta * r_w) - 0.5 * log sigma(-beta * r_l)
where r_w = log(pi/pi_ref)(y_w), r_l = log(pi/pi_ref)(y_l)

DiscoPOP:

alpha = sigma(beta * logits / tau)
L = -log sigma(beta * logits) * (1 - alpha) + exp(-beta * logits) * alpha

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment