Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Eric mitchell Direct preference optimization SFT Checkpoint Loading

From Leeroopedia


Knowledge Sources
Domains Transfer_Learning, Checkpointing, Deep_Learning
Last Updated 2026-02-08 02:00 GMT

Overview

A model initialization technique that loads pre-trained SFT weights into both the policy and reference models as the starting point for DPO training.

Description

SFT checkpoint loading bridges the two-stage training pipeline by loading the weights produced during SFT training into the models used for DPO training. Both the policy model (which will be further trained) and the reference model (which remains frozen) are initialized from the same SFT checkpoint. This ensures:

  • The policy starts from an in-distribution model that has learned to follow the prompt format
  • The reference model captures the SFT baseline for computing KL-divergence penalties
  • Both models share identical initial weights, so the DPO loss starts at zero

Usage

Use this principle when transitioning from SFT to DPO training. The checkpoint path is specified via config.model.archive. If archive is None, this step is skipped (useful for running DPO directly from pre-trained weights).

Theoretical Basis

In the DPO framework, the reference policy πref represents the model we want the trained policy to stay close to. Initializing both from SFT weights means:

πθ(0)=πref=πSFT

At the start of DPO training, DKL[πθπref]=0, and the loss function will pull the policy toward preferences while the KL penalty prevents it from diverging too far from the SFT baseline.

Pseudo-code:

# Abstract checkpoint loading (NOT actual implementation)
checkpoint = load("policy.pt")
policy.load_weights(checkpoint.state)
reference.load_weights(checkpoint.state)
# Now policy == reference == SFT model

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment