Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:LLMBook zh LLMBook zh github io Direct Preference Optimization

From Leeroopedia


Knowledge Sources
Domains NLP, Alignment, Optimization
Last Updated 2026-02-08 00:00 GMT

Overview

An alignment technique that directly optimizes a language model on preference data without training a separate reward model or using reinforcement learning.

Description

Direct Preference Optimization (DPO) reformulates the RLHF objective into a simple classification loss over preference pairs. Instead of the traditional three-step RLHF pipeline (train reward model, then use PPO to optimize), DPO derives a closed-form loss that directly increases the likelihood of preferred responses relative to rejected ones, regularized by a KL divergence from a reference model.

DPO achieves comparable or superior results to PPO-based RLHF while being simpler to implement, more stable to train, and computationally lighter.

Usage

Use DPO when aligning a language model to human preferences. It requires a preference dataset with (prompt, chosen, rejected) triples, a policy model, and a frozen reference model. DPO is preferred over PPO when simplicity and training stability are priorities.

Theoretical Basis

The DPO loss is:

DPO(πθ;πref)=𝔼(x,yw,yl)[logσ(βlogπθ(yw|x)πref(yw|x)βlogπθ(yl|x)πref(yl|x))]

Where:

  • πθ is the policy model being trained
  • πref is the frozen reference model
  • yw is the preferred response, yl is the rejected response
  • β controls how much the policy can diverge from the reference (default 0.1)

Higher β means less divergence from the reference model.

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment