Principle:Axolotl ai cloud Axolotl DPO Training Execution

Knowledge Sources	DPO: Direct Preference Optimization IPO: A General Theoretical Paradigm TRL DPO Trainer Axolotl
Domains	Alignment, Reinforcement_Learning, Training
Last Updated	2026-02-06 23:00 GMT

Overview

A training execution pattern that optimizes a language model to align with human preferences using paired chosen/rejected responses without explicit reward modeling.

Description

Direct Preference Optimization (DPO) training bypasses the traditional RLHF pipeline (reward model + PPO) by directly optimizing the policy model on preference pairs. The DPO loss function implicitly defines a reward through the log-probability ratio between the policy and reference models.

In Axolotl, DPO training is handled by HFRLTrainerBuilder which constructs an AxolotlDPOTrainer (extending TRL's DPOTrainer). The builder configures DPO-specific training arguments via DPOStrategy, which sets the loss type (DPO/IPO), label smoothing, max length, and evaluation generation settings.

Axolotl supports multiple DPO variants: standard DPO, IPO (which uses a different loss function), SimPO (reference-free), and ORPO (odds ratio).

Usage

Use DPO training execution when:

Aligning a model with human preferences
Having paired chosen/rejected response data
Preferring a simpler approach than full RLHF (reward model + PPO)
The model has already been instruction-tuned (SFT) and needs alignment

Theoretical Basis

DPO Loss: $ℒ_{D P O} (π_{θ}; π_{r e f}) = - 𝔼_{(x, y_{w}, y_{l})} [\log σ (β \log \frac{π_{θ} (y_{w} | x)}{π_{r e f} (y_{w} | x)} - β \log \frac{π_{θ} (y_{l} | x)}{π_{r e f} (y_{l} | x)})]$

IPO Loss (alternative): $ℒ_{I P O} = {(\log \frac{π_{θ} (y_{w} | x)}{π_{r e f} (y_{w} | x)} - \log \frac{π_{θ} (y_{l} | x)}{π_{r e f} (y_{l} | x)} - \frac{1}{2 β})}^{2}$

Key hyperparameters:

beta: Controls KL penalty strength. Lower = more divergence allowed
label_smoothing: Soft labels for noisy preference data
generate_during_eval: Generate completions during evaluation for qualitative assessment

Related Pages

Implemented By

Implementation:Axolotl_ai_cloud_Axolotl_HFRLTrainerBuilder_Build

Uses Heuristic

Heuristic:Axolotl_ai_cloud_Axolotl_Gradient_Checkpointing_Reentrant_Rules

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment