Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:OpenRLHF OpenRLHF Direct Preference Optimization

From Leeroopedia


Knowledge Sources
Domains NLP, Alignment, Training
Last Updated 2026-02-07 00:00 GMT

Overview

An alignment method that directly optimizes a language model policy from preference data without training an explicit reward model.

Description

Direct Preference Optimization (DPO) reformulates the RLHF objective to eliminate the need for a separate reward model and RL training loop. It derives a closed-form solution for the optimal policy under a KL-constrained reward maximization objective, then directly optimizes the policy using a binary cross-entropy loss over preference pairs.

DPO requires a frozen reference model (typically the SFT model) and a policy model that is trained to increase the log-probability ratio of chosen over rejected responses relative to the reference model.

Usage

Use DPO when you have preference data but want to avoid the complexity of training a separate reward model and running PPO. DPO is simpler and often more stable than PPO, though it requires paired preference data. Also used in iterative DPO loops with on-policy data generation.

Theoretical Basis

DPO starts from the RLHF objective and derives the implicit reward: r(x,y)=βlogπθ(y|x)πref(y|x)+βlogZ(x)

Substituting into the Bradley-Terry preference model gives the DPO loss: LDPO=𝔼[logσ(βlogπθ(yw|x)πref(yw|x)βlogπθ(yl|x)πref(yl|x))]

Variants supported:

  • Standard DPO: The loss above
  • cDPO: Conservative DPO with label smoothing
  • IPO: Identity Preference Optimization - replaces sigmoid with squared loss

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment