Principle:Huggingface Trl DPO Policy Model Loading
| Knowledge Sources | |
|---|---|
| Domains | NLP, RLHF |
| Last Updated | 2026-02-06 17:00 GMT |
Overview
Loading and initializing the policy model that will be optimized through preference learning is the foundational step in any DPO training pipeline.
Description
In Direct Preference Optimization, the policy model (also called the active model) is the language model whose parameters are updated during training. This model learns to assign higher probability to chosen responses and lower probability to rejected responses relative to a reference distribution.
The policy model is typically initialized from a pretrained or supervised fine-tuned (SFT) checkpoint. Starting from an SFT model is standard practice because:
- The model already has reasonable language generation capabilities
- The DPO objective refines preferences rather than teaching basic fluency
- The reference model (which anchors the KL penalty) should match the policy's starting point
Key considerations when loading the policy model include:
- Model architecture: DPO operates on causal language models (decoder-only architectures like GPT, LLaMA, Qwen, Mistral) or encoder-decoder models. The model must support computing per-token log probabilities over completions.
- Precision and quantization: For memory efficiency, models can be loaded in reduced precision (bf16, fp16) or with quantization (4-bit, 8-bit via bitsandbytes). Quantization is especially important for large models that would otherwise not fit in GPU memory. When quantization is used, a device map is configured to distribute model layers across available devices.
- Attention implementation: Different attention backends (standard, Flash Attention 2, SDPA) can be selected for performance optimization. Flash Attention 2 is required for padding-free training.
- Trust remote code: When loading community models from the Hugging Face Hub, the
trust_remote_codeflag must be enabled to allow execution of custom model code.
Usage
Load the policy model when:
- Starting a new DPO training run from a pretrained or SFT checkpoint
- Resuming DPO training from a saved checkpoint
- Running DPO with quantized models to reduce memory footprint
- Setting up multi-GPU training where the model needs specific device placement
Theoretical Basis
In the DPO framework, the policy model pi_theta is the parameterized distribution being optimized. The training objective adjusts theta to maximize the likelihood of preferred responses while staying close to the reference distribution:
max_theta E_{(x, y_w, y_l) ~ D} [ log sigma( beta * ( log(pi_theta(y_w|x)/pi_ref(y_w|x)) - log(pi_theta(y_l|x)/pi_ref(y_l|x)) ) ) ]
The policy model pi_theta provides the forward pass that computes log pi_theta(y|x) for both the chosen and rejected completions. These log probabilities are computed per-token and summed over the completion tokens (excluding the prompt).
The quality of the initial policy model directly affects training dynamics. If the policy starts far from the distribution that generated the preference data, the implicit rewards may be poorly calibrated, leading to unstable training.