Principle:Eric mitchell Direct preference optimization SFT Checkpoint Loading
| Knowledge Sources | |
|---|---|
| Domains | Transfer_Learning, Checkpointing, Deep_Learning |
| Last Updated | 2026-02-08 02:00 GMT |
Overview
A model initialization technique that loads pre-trained SFT weights into both the policy and reference models as the starting point for DPO training.
Description
SFT checkpoint loading bridges the two-stage training pipeline by loading the weights produced during SFT training into the models used for DPO training. Both the policy model (which will be further trained) and the reference model (which remains frozen) are initialized from the same SFT checkpoint. This ensures:
- The policy starts from an in-distribution model that has learned to follow the prompt format
- The reference model captures the SFT baseline for computing KL-divergence penalties
- Both models share identical initial weights, so the DPO loss starts at zero
Usage
Use this principle when transitioning from SFT to DPO training. The checkpoint path is specified via config.model.archive. If archive is None, this step is skipped (useful for running DPO directly from pre-trained weights).
Theoretical Basis
In the DPO framework, the reference policy represents the model we want the trained policy to stay close to. Initializing both from SFT weights means:
At the start of DPO training, , and the loss function will pull the policy toward preferences while the KL penalty prevents it from diverging too far from the SFT baseline.
Pseudo-code:
# Abstract checkpoint loading (NOT actual implementation)
checkpoint = load("policy.pt")
policy.load_weights(checkpoint.state)
reference.load_weights(checkpoint.state)
# Now policy == reference == SFT model