Principle:Axolotl ai cloud Axolotl Reference Model Setup
| Knowledge Sources | |
|---|---|
| Domains | Alignment, Reinforcement_Learning, Model_Loading |
| Last Updated | 2026-02-06 23:00 GMT |
Overview
A model management pattern that maintains a frozen copy of the pre-training policy to compute KL-divergence regularization during preference-based alignment training.
Description
In Direct Preference Optimization (DPO) and related alignment methods, a reference model is needed to prevent the policy model from diverging too far from its pre-trained behavior. The DPO loss function computes log-probability ratios between the policy model and the reference model, providing implicit KL-divergence regularization.
The reference model setup involves deciding whether to load a separate model copy or use TRL's auto-unwrap feature (which shares the base model weights when using LoRA adapters). The auto-unwrap approach saves significant GPU memory by avoiding a full model duplication.
Usage
Use reference model setup when:
- Training with DPO, IPO, or KTO (methods that require a reference policy)
- NOT needed for ORPO or SimPO (these methods are reference-free)
- With LoRA training, TRL can auto-unwrap the base model (no separate copy needed)
Theoretical Basis
The DPO objective includes an implicit KL penalty via the reference model:
Where is the frozen reference model and is the trainable policy.
Reference model strategies:
- Separate model: Load a full copy (doubles memory usage)
- Auto-unwrap (LoRA): TRL uses base model weights as reference (no extra memory)
- None (ORPO): Reference-free methods skip this entirely