Workflow:Intel Ipex llm DPO Finetuning

Knowledge Sources	IPEX-LLM TRL DPO Documentation
Domains	LLMs, Fine_Tuning, RLHF
Last Updated	2026-02-09 04:00 GMT

Overview

End-to-end process for aligning Large Language Models with human preferences using Direct Preference Optimization (DPO) on Intel GPUs with IPEX-LLM.

Description

This workflow implements Direct Preference Optimization (DPO), a method for aligning language models with human preferences without requiring a separate reward model. It uses paired preference data (chosen vs. rejected responses) to directly optimize the model's policy. The workflow loads a base model with 4-bit NF4 quantization via IPEX-LLM, applies LoRA adapters for parameter-efficient training, loads a separate reference model for the DPO loss computation, and trains using TRL's DPOTrainer. This approach is simpler and more stable than traditional RLHF with PPO, while achieving comparable alignment quality.

Usage

Execute this workflow when you have a preference dataset containing paired chosen/rejected responses (such as Intel/orca_dpo_pairs) and want to align an instruction-tuned model to better match human preferences. Requires an Intel GPU with sufficient memory to hold both the training model and the reference model simultaneously (approximately 2x the single-model memory requirement).

Execution Steps

Step 1: Environment Setup

Configure the Intel GPU runtime environment with oneAPI toolkit variables and XPU settings. Ensure sufficient GPU memory is available for both the main model and the reference model (both loaded in 4-bit quantization).

Key considerations:

DPO requires approximately 2x GPU memory compared to standard fine-tuning (two model copies)
Both models are quantized to 4-bit to manage memory constraints
Single-GPU workflow (no distributed training in this example)

Step 2: Data Preparation

Load the preference dataset and format it into the DPO-expected structure with prompt, chosen, and rejected fields. Apply the model's chat template (e.g., ChatML format) to wrap system messages, user instructions, and responses with appropriate special tokens. Remove original columns and retain only the formatted fields.

Key considerations:

Dataset must have paired chosen/rejected responses for each prompt
Chat template formatting ensures proper token boundaries
The chatml_format function maps system/question/chosen/rejected fields
Tokenizer padding side should be set to "left" for DPO training

Step 3: Model Loading with Quantization

Load the main training model using IPEX-LLM's AutoModelForCausalLM with BitsAndBytesConfig for NF4 4-bit quantization. Move to XPU device. Prepare for k-bit training and apply LoRA adapters targeting all linear layers. Then load a separate reference model (also in 4-bit) that remains frozen during training and serves as the baseline for the DPO loss calculation.

Key considerations:

Main model gets LoRA adapters; reference model stays frozen
Both models use NF4 quantization for memory efficiency
Reference model uses load_in_low_bit="nf4" with optimize_model=False
LoRA rank 16 with alpha 16 targets all attention and FFN projection layers

Step 4: DPO Training

Configure TRL's DPOConfig with training hyperparameters including the DPO beta parameter (controlling deviation from the reference policy), prompt and response length limits, learning rate, and optimizer settings. Create a DPOTrainer with both the main and reference models, and launch training. The DPO loss directly optimizes the model to prefer chosen responses over rejected ones relative to the reference model's preferences.

Key considerations:

Beta parameter (default 0.1) controls KL divergence penalty from reference model
max_prompt_length and max_length control memory usage during training
AdamW optimizer used (paged_adamw not yet supported on XPU)
bf16=True for training stability on Intel GPUs
Warmup steps help stabilize early training

Step 5: Model Export

Save the trained model (with LoRA adapters) and tokenizer to the output directory. The saved artifacts can be loaded for inference or further merged with the base model.

Key considerations:

Saves both model weights and tokenizer configuration
LoRA adapter can be merged with base model for standalone deployment
Output directory contains the fine-tuned model ready for evaluation

Execution Diagram

GitHub URL

Workflow Repository