Principle:Axolotl ai cloud Axolotl Reference Model Setup

Knowledge Sources	DPO: Direct Preference Optimization RLHF Axolotl
Domains	Alignment, Reinforcement_Learning, Model_Loading
Last Updated	2026-02-06 23:00 GMT

Overview

A model management pattern that maintains a frozen copy of the pre-training policy to compute KL-divergence regularization during preference-based alignment training.

Description

In Direct Preference Optimization (DPO) and related alignment methods, a reference model is needed to prevent the policy model from diverging too far from its pre-trained behavior. The DPO loss function computes log-probability ratios between the policy model and the reference model, providing implicit KL-divergence regularization.

The reference model setup involves deciding whether to load a separate model copy or use TRL's auto-unwrap feature (which shares the base model weights when using LoRA adapters). The auto-unwrap approach saves significant GPU memory by avoiding a full model duplication.

Usage

Use reference model setup when:

Training with DPO, IPO, or KTO (methods that require a reference policy)
NOT needed for ORPO or SimPO (these methods are reference-free)
With LoRA training, TRL can auto-unwrap the base model (no separate copy needed)

Theoretical Basis

The DPO objective includes an implicit KL penalty via the reference model:

$ℒ_{D P O} = - \log σ (β [\log \frac{π_{θ} (y_{w} | x)}{π_{r e f} (y_{w} | x)} - \log \frac{π_{θ} (y_{l} | x)}{π_{r e f} (y_{l} | x)}])$

Where $π_{r e f}$ is the frozen reference model and $π_{θ}$ is the trainable policy.

Reference model strategies:

Separate model: Load a full copy (doubles memory usage)
Auto-unwrap (LoRA): TRL uses base model weights as reference (no extra memory)
None (ORPO): Reference-free methods skip this entirely

Related Pages

Implemented By

Implementation:Axolotl_ai_cloud_Axolotl_Setup_Reference_Model

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment