Principle:Intel Ipex llm DPO Model Loading

Knowledge Sources	Direct Preference Optimization QLoRA IPEX-LLM
Domains	NLP, RLHF, Model_Loading
Last Updated	2026-02-09 00:00 GMT

Overview

Technique for loading both the trainable policy model and frozen reference model required by Direct Preference Optimization.

Description

DPO training requires two copies of the model: a trainable policy model (loaded with 4-bit quantization and LoRA adapters) and a frozen reference model (loaded in NF4 for computing the reference log-probabilities). The policy model uses BitsAndBytesConfig for quantization, then is prepared with prepare_model_for_kbit_training and wrapped with get_peft_model. The reference model is loaded separately with load_in_low_bit="nf4" and kept frozen throughout training.

Usage

Use this when setting up DPO training. Both models must be loaded and moved to XPU. The reference model provides the baseline log-probabilities needed to compute the DPO loss.

Theoretical Basis

DPO loss requires log-probabilities from both policy and reference models:

$L_{D P O} = - \log σ (β [\log \frac{π_{θ} (y_{w} | x)}{π_{r e f} (y_{w} | x)} - \log \frac{π_{θ} (y_{l} | x)}{π_{r e f} (y_{l} | x)}])$

Where $π_{θ}$ is the policy model (trainable) and $π_{r e f}$ is the reference model (frozen).

Related Pages

Implemented By

Implementation:Intel_Ipex_llm_AutoModelForCausalLM_From_Pretrained_DPO

Uses Heuristic

Heuristic:Intel_Ipex_llm_NF4_Quantization_Best_Practice

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment