Principle:OpenRLHF OpenRLHF Direct Preference Optimization

Knowledge Sources	Direct Preference Optimization: Your Language Model is Secretly a Reward Model A General Theoretical Paradigm to Understand Learning from Human Feedback
Domains	NLP, Alignment, Training
Last Updated	2026-02-07 00:00 GMT

Overview

An alignment method that directly optimizes a language model policy from preference data without training an explicit reward model.

Description

Direct Preference Optimization (DPO) reformulates the RLHF objective to eliminate the need for a separate reward model and RL training loop. It derives a closed-form solution for the optimal policy under a KL-constrained reward maximization objective, then directly optimizes the policy using a binary cross-entropy loss over preference pairs.

DPO requires a frozen reference model (typically the SFT model) and a policy model that is trained to increase the log-probability ratio of chosen over rejected responses relative to the reference model.

Usage

Use DPO when you have preference data but want to avoid the complexity of training a separate reward model and running PPO. DPO is simpler and often more stable than PPO, though it requires paired preference data. Also used in iterative DPO loops with on-policy data generation.

Theoretical Basis

DPO starts from the RLHF objective and derives the implicit reward: $r (x, y) = β \log \frac{π_{θ} (y | x)}{π_{r e f} (y | x)} + β \log Z (x)$

Substituting into the Bradley-Terry preference model gives the DPO loss: $L_{D P O} = - 𝔼 [\log σ (β \log \frac{π_{θ} (y_{w} | x)}{π_{r e f} (y_{w} | x)} - β \log \frac{π_{θ} (y_{l} | x)}{π_{r e f} (y_{l} | x)})]$

Variants supported:

Standard DPO: The loss above
cDPO: Conservative DPO with label smoothing
IPO: Identity Preference Optimization - replaces sigmoid with squared loss

Related Pages

Implemented By

Implementation:OpenRLHF_OpenRLHF_DPOTrainer

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment