Principle:LLMBook zh LLMBook zh github io Direct Preference Optimization

Knowledge Sources	Direct Preference Optimization: Your Language Model is Secretly a Reward Model LLMBook-zh
Domains	NLP, Alignment, Optimization
Last Updated	2026-02-08 00:00 GMT

Overview

An alignment technique that directly optimizes a language model on preference data without training a separate reward model or using reinforcement learning.

Description

Direct Preference Optimization (DPO) reformulates the RLHF objective into a simple classification loss over preference pairs. Instead of the traditional three-step RLHF pipeline (train reward model, then use PPO to optimize), DPO derives a closed-form loss that directly increases the likelihood of preferred responses relative to rejected ones, regularized by a KL divergence from a reference model.

DPO achieves comparable or superior results to PPO-based RLHF while being simpler to implement, more stable to train, and computationally lighter.

Usage

Use DPO when aligning a language model to human preferences. It requires a preference dataset with (prompt, chosen, rejected) triples, a policy model, and a frozen reference model. DPO is preferred over PPO when simplicity and training stability are priorities.

Theoretical Basis

The DPO loss is:

$ℒ_{D P O} (π_{θ}; π_{r e f}) = - 𝔼_{(x, y_{w}, y_{l})} [\log σ (β \log \frac{π_{θ} (y_{w} | x)}{π_{r e f} (y_{w} | x)} - β \log \frac{π_{θ} (y_{l} | x)}{π_{r e f} (y_{l} | x)})]$

Where:

$π_{θ}$ is the policy model being trained
$π_{r e f}$ is the frozen reference model
$y_{w}$ is the preferred response, $y_{l}$ is the rejected response
$β$ controls how much the policy can diverge from the reference (default 0.1)

Higher $β$ means less divergence from the reference model.

Related Pages

Implemented By

Implementation:LLMBook_zh_LLMBook_zh_github_io_DPOTrainer_Train

Uses Heuristic

Heuristic:LLMBook_zh_LLMBook_zh_github_io_DPO_Beta_Hyperparameter

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment