Principle:LLMBook zh LLMBook zh github io DPO Model Loading
| Knowledge Sources | |
|---|---|
| Domains | NLP, Alignment |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
The process of loading both a trainable policy model and a frozen reference model from the same checkpoint for Direct Preference Optimization training.
Description
DPO Model Loading requires two copies of the same pre-trained model: a policy model (trainable) that will be optimized, and a reference model (frozen) that provides the baseline log-probabilities for the DPO loss. The reference model is set to evaluation mode and all its parameters are frozen (requires_grad=False).
Usage
Use this when setting up DPO training. Both models must start from identical weights; the policy model diverges during training while the reference model stays fixed.
Theoretical Basis
DPO requires comparing the policy and reference model log-probabilities:
The reference model must remain unchanged during training.