Principle:LLMBook zh LLMBook zh github io DPO Model Loading

Knowledge Sources	Direct Preference Optimization LLMBook-zh
Domains	NLP, Alignment
Last Updated	2026-02-08 00:00 GMT

Overview

The process of loading both a trainable policy model and a frozen reference model from the same checkpoint for Direct Preference Optimization training.

Description

DPO Model Loading requires two copies of the same pre-trained model: a policy model (trainable) that will be optimized, and a reference model (frozen) that provides the baseline log-probabilities for the DPO loss. The reference model is set to evaluation mode and all its parameters are frozen (requires_grad=False).

Usage

Use this when setting up DPO training. Both models must start from identical weights; the policy model diverges during training while the reference model stays fixed.

Theoretical Basis

DPO requires comparing the policy and reference model log-probabilities:

$ℒ_{D P O} = - \log σ (β \log \frac{π_{θ} (y_{w} | x)}{π_{r e f} (y_{w} | x)} - β \log \frac{π_{θ} (y_{l} | x)}{π_{r e f} (y_{l} | x)})$

The reference model $π_{r e f}$ must remain unchanged during training.

Related Pages

Implemented By

Implementation:LLMBook_zh_LLMBook_zh_github_io_AutoModelForCausalLM_From_Pretrained_DPO

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment