Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:LLMBook zh LLMBook zh github io DPO Model Loading

From Leeroopedia


Knowledge Sources
Domains NLP, Alignment
Last Updated 2026-02-08 00:00 GMT

Overview

The process of loading both a trainable policy model and a frozen reference model from the same checkpoint for Direct Preference Optimization training.

Description

DPO Model Loading requires two copies of the same pre-trained model: a policy model (trainable) that will be optimized, and a reference model (frozen) that provides the baseline log-probabilities for the DPO loss. The reference model is set to evaluation mode and all its parameters are frozen (requires_grad=False).

Usage

Use this when setting up DPO training. Both models must start from identical weights; the policy model diverges during training while the reference model stays fixed.

Theoretical Basis

DPO requires comparing the policy and reference model log-probabilities:

DPO=logσ(βlogπθ(yw|x)πref(yw|x)βlogπθ(yl|x)πref(yl|x))

The reference model πref must remain unchanged during training.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment