Principle:Intel Ipex llm DPO Model Loading
| Knowledge Sources | |
|---|---|
| Domains | NLP, RLHF, Model_Loading |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Technique for loading both the trainable policy model and frozen reference model required by Direct Preference Optimization.
Description
DPO training requires two copies of the model: a trainable policy model (loaded with 4-bit quantization and LoRA adapters) and a frozen reference model (loaded in NF4 for computing the reference log-probabilities). The policy model uses BitsAndBytesConfig for quantization, then is prepared with prepare_model_for_kbit_training and wrapped with get_peft_model. The reference model is loaded separately with load_in_low_bit="nf4" and kept frozen throughout training.
Usage
Use this when setting up DPO training. Both models must be loaded and moved to XPU. The reference model provides the baseline log-probabilities needed to compute the DPO loss.
Theoretical Basis
DPO loss requires log-probabilities from both policy and reference models:
Where is the policy model (trainable) and is the reference model (frozen).