Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Intel Ipex llm DPO Model Loading

From Leeroopedia


Knowledge Sources
Domains NLP, RLHF, Model_Loading
Last Updated 2026-02-09 00:00 GMT

Overview

Technique for loading both the trainable policy model and frozen reference model required by Direct Preference Optimization.

Description

DPO training requires two copies of the model: a trainable policy model (loaded with 4-bit quantization and LoRA adapters) and a frozen reference model (loaded in NF4 for computing the reference log-probabilities). The policy model uses BitsAndBytesConfig for quantization, then is prepared with prepare_model_for_kbit_training and wrapped with get_peft_model. The reference model is loaded separately with load_in_low_bit="nf4" and kept frozen throughout training.

Usage

Use this when setting up DPO training. Both models must be loaded and moved to XPU. The reference model provides the baseline log-probabilities needed to compute the DPO loss.

Theoretical Basis

DPO loss requires log-probabilities from both policy and reference models:

LDPO=logσ(β[logπθ(yw|x)πref(yw|x)logπθ(yl|x)πref(yl|x)])

Where πθ is the policy model (trainable) and πref is the reference model (frozen).

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment