Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Axolotl ai cloud Axolotl Reference Model Setup

From Leeroopedia


Knowledge Sources
Domains Alignment, Reinforcement_Learning, Model_Loading
Last Updated 2026-02-06 23:00 GMT

Overview

A model management pattern that maintains a frozen copy of the pre-training policy to compute KL-divergence regularization during preference-based alignment training.

Description

In Direct Preference Optimization (DPO) and related alignment methods, a reference model is needed to prevent the policy model from diverging too far from its pre-trained behavior. The DPO loss function computes log-probability ratios between the policy model and the reference model, providing implicit KL-divergence regularization.

The reference model setup involves deciding whether to load a separate model copy or use TRL's auto-unwrap feature (which shares the base model weights when using LoRA adapters). The auto-unwrap approach saves significant GPU memory by avoiding a full model duplication.

Usage

Use reference model setup when:

  • Training with DPO, IPO, or KTO (methods that require a reference policy)
  • NOT needed for ORPO or SimPO (these methods are reference-free)
  • With LoRA training, TRL can auto-unwrap the base model (no separate copy needed)

Theoretical Basis

The DPO objective includes an implicit KL penalty via the reference model:

DPO=logσ(β[logπθ(yw|x)πref(yw|x)logπθ(yl|x)πref(yl|x)])

Where πref is the frozen reference model and πθ is the trainable policy.

Reference model strategies:

  • Separate model: Load a full copy (doubles memory usage)
  • Auto-unwrap (LoRA): TRL uses base model weights as reference (no extra memory)
  • None (ORPO): Reference-free methods skip this entirely

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment