Principle:Hpcaitech ColossalAI DPO Training

Knowledge Sources	ColossalAI Direct Preference Optimization
Domains	NLP, Reinforcement_Learning
Last Updated	2026-02-09 00:00 GMT

Overview

A preference alignment algorithm that directly optimizes a language model's policy using human preference data without requiring a separate reward model.

Description

Direct Preference Optimization (DPO) reformulates the RLHF objective as a classification problem on preference pairs. Instead of training a reward model and then using RL (like PPO), DPO directly increases the probability of chosen responses relative to rejected responses, with a KL divergence penalty against a frozen reference model to prevent distribution collapse.

The training loop requires maintaining two models: a trainable policy model and a frozen reference model (typically a copy of the initial policy). For each batch, both models compute log probabilities on chosen and rejected sequences, and the DPO loss is computed from the difference in log probability ratios.

Usage

Use DPO when you have human preference data (chosen/rejected pairs) and want to align a model without the complexity of training a separate reward model. DPO is simpler and more stable than PPO-based RLHF.

Theoretical Basis

The DPO loss function:

$ℒ_{D P O} (π_{θ}; π_{r e f}) = - 𝔼_{(x, y_{w}, y_{l})} [\log σ (β \log \frac{π_{θ} (y_{w} | x)}{π_{r e f} (y_{w} | x)} - β \log \frac{π_{θ} (y_{l} | x)}{π_{r e f} (y_{l} | x)})]$

Where:

$π_{θ}$ is the policy model being trained
$π_{r e f}$ is the frozen reference model
$y_{w}$ is the chosen (winning) response
$y_{l}$ is the rejected (losing) response
$β$ is the temperature parameter controlling deviation from reference

Related Pages

Implemented By

Implementation:Hpcaitech_ColossalAI_DPOTrainer

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment