Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Huggingface Alignment handbook Direct Preference Optimization

From Leeroopedia


Knowledge Sources
Domains NLP, Deep_Learning, Reinforcement_Learning
Last Updated 2026-02-07 00:00 GMT

Overview

An alignment algorithm that directly optimizes a language model's policy using human preference data without requiring a separate reward model or reinforcement learning loop.

Description

Direct Preference Optimization (DPO) is an alternative to RLHF that eliminates the need for training a reward model and running PPO. Instead, DPO reparameterizes the reward function to derive a closed-form loss that directly optimizes the policy model using preference pairs (chosen vs. rejected responses).

DPO addresses the complexity and instability of the traditional RLHF pipeline (SFT → Reward Model → PPO) by showing that the optimal policy under the Bradley-Terry preference model can be learned with a simple binary cross-entropy-like loss. This makes DPO simpler to implement, more stable to train, and computationally cheaper than PPO-based RLHF.

In the alignment-handbook, DPO is the second stage of the alignment pipeline, applied after SFT. The DPO trainer requires both a policy model (the model being optimized) and a frozen reference model (typically the SFT checkpoint) to compute the implicit reward.

Usage

Use DPO when:

  • You have preference data (chosen/rejected response pairs for the same prompts)
  • You want to improve model alignment beyond what SFT achieves
  • You prefer a simpler alternative to PPO-based RLHF
  • A reference model (usually the SFT checkpoint) is available

Theoretical Basis

The DPO loss is derived from the Bradley-Terry preference model:

DPO(πθ;πref)=𝔼(x,yw,yl)D[logσ(βlogπθ(yw|x)πref(yw|x)βlogπθ(yl|x)πref(yl|x))]

Where:

  • πθ is the policy model being trained
  • πref is the frozen reference model (SFT checkpoint)
  • yw is the chosen (preferred) response
  • yl is the rejected response
  • β is a temperature parameter controlling deviation from the reference policy
  • σ is the sigmoid function
# Abstract DPO algorithm (NOT real implementation)
for prompt, chosen, rejected in preference_data:
    log_ratio_chosen = log_prob(policy, chosen) - log_prob(ref_model, chosen)
    log_ratio_rejected = log_prob(policy, rejected) - log_prob(ref_model, rejected)
    loss = -log_sigmoid(beta * (log_ratio_chosen - log_ratio_rejected))
    loss.backward()
    optimizer.step()

The beta parameter (typically 0.01-0.1) controls how much the policy can deviate from the reference model. Lower values allow more deviation; higher values keep the policy closer to the reference.

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment