Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Hpcaitech ColossalAI DPO Training

From Leeroopedia


Knowledge Sources
Domains NLP, Reinforcement_Learning
Last Updated 2026-02-09 00:00 GMT

Overview

A preference alignment algorithm that directly optimizes a language model's policy using human preference data without requiring a separate reward model.

Description

Direct Preference Optimization (DPO) reformulates the RLHF objective as a classification problem on preference pairs. Instead of training a reward model and then using RL (like PPO), DPO directly increases the probability of chosen responses relative to rejected responses, with a KL divergence penalty against a frozen reference model to prevent distribution collapse.

The training loop requires maintaining two models: a trainable policy model and a frozen reference model (typically a copy of the initial policy). For each batch, both models compute log probabilities on chosen and rejected sequences, and the DPO loss is computed from the difference in log probability ratios.

Usage

Use DPO when you have human preference data (chosen/rejected pairs) and want to align a model without the complexity of training a separate reward model. DPO is simpler and more stable than PPO-based RLHF.

Theoretical Basis

The DPO loss function:

DPO(πθ;πref)=𝔼(x,yw,yl)[logσ(βlogπθ(yw|x)πref(yw|x)βlogπθ(yl|x)πref(yl|x))]

Where:

  • πθ is the policy model being trained
  • πref is the frozen reference model
  • yw is the chosen (winning) response
  • yl is the rejected (losing) response
  • β is the temperature parameter controlling deviation from reference

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment