Principle:Eric mitchell Direct preference optimization SFT Checkpoint Loading

Knowledge Sources	Direct Preference Optimization PyTorch Saving and Loading
Domains	Transfer_Learning, Checkpointing, Deep_Learning
Last Updated	2026-02-08 02:00 GMT

Overview

A model initialization technique that loads pre-trained SFT weights into both the policy and reference models as the starting point for DPO training.

Description

SFT checkpoint loading bridges the two-stage training pipeline by loading the weights produced during SFT training into the models used for DPO training. Both the policy model (which will be further trained) and the reference model (which remains frozen) are initialized from the same SFT checkpoint. This ensures:

The policy starts from an in-distribution model that has learned to follow the prompt format
The reference model captures the SFT baseline for computing KL-divergence penalties
Both models share identical initial weights, so the DPO loss starts at zero

Usage

Use this principle when transitioning from SFT to DPO training. The checkpoint path is specified via config.model.archive. If archive is None, this step is skipped (useful for running DPO directly from pre-trained weights).

Theoretical Basis

In the DPO framework, the reference policy $π_{r e f}$ represents the model we want the trained policy to stay close to. Initializing both from SFT weights means:

$π_{θ}^{(0)} = π_{r e f} = π_{S F T}$

At the start of DPO training, $D_{K L} [π_{θ} ‖ π_{r e f}] = 0$ , and the loss function will pull the policy toward preferences while the KL penalty prevents it from diverging too far from the SFT baseline.

Pseudo-code:

# Abstract checkpoint loading (NOT actual implementation)
checkpoint = load("policy.pt")
policy.load_weights(checkpoint.state)
reference.load_weights(checkpoint.state)
# Now policy == reference == SFT model

Related Pages

Implemented By

Implementation:Eric_mitchell_Direct_preference_optimization_Torch_Load_State_Dict

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment