Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Axolotl ai cloud Axolotl DPO Training Execution

From Leeroopedia


Knowledge Sources
Domains Alignment, Reinforcement_Learning, Training
Last Updated 2026-02-06 23:00 GMT

Overview

A training execution pattern that optimizes a language model to align with human preferences using paired chosen/rejected responses without explicit reward modeling.

Description

Direct Preference Optimization (DPO) training bypasses the traditional RLHF pipeline (reward model + PPO) by directly optimizing the policy model on preference pairs. The DPO loss function implicitly defines a reward through the log-probability ratio between the policy and reference models.

In Axolotl, DPO training is handled by HFRLTrainerBuilder which constructs an AxolotlDPOTrainer (extending TRL's DPOTrainer). The builder configures DPO-specific training arguments via DPOStrategy, which sets the loss type (DPO/IPO), label smoothing, max length, and evaluation generation settings.

Axolotl supports multiple DPO variants: standard DPO, IPO (which uses a different loss function), SimPO (reference-free), and ORPO (odds ratio).

Usage

Use DPO training execution when:

  • Aligning a model with human preferences
  • Having paired chosen/rejected response data
  • Preferring a simpler approach than full RLHF (reward model + PPO)
  • The model has already been instruction-tuned (SFT) and needs alignment

Theoretical Basis

DPO Loss: DPO(πθ;πref)=𝔼(x,yw,yl)[logσ(βlogπθ(yw|x)πref(yw|x)βlogπθ(yl|x)πref(yl|x))]

IPO Loss (alternative): IPO=(logπθ(yw|x)πref(yw|x)logπθ(yl|x)πref(yl|x)12β)2

Key hyperparameters:

  • beta: Controls KL penalty strength. Lower = more divergence allowed
  • label_smoothing: Soft labels for noisy preference data
  • generate_during_eval: Generate completions during evaluation for qualitative assessment

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment