Workflow:Intel Ipex llm DPO Finetuning
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Fine_Tuning, RLHF |
| Last Updated | 2026-02-09 04:00 GMT |
Overview
End-to-end process for aligning Large Language Models with human preferences using Direct Preference Optimization (DPO) on Intel GPUs with IPEX-LLM.
Description
This workflow implements Direct Preference Optimization (DPO), a method for aligning language models with human preferences without requiring a separate reward model. It uses paired preference data (chosen vs. rejected responses) to directly optimize the model's policy. The workflow loads a base model with 4-bit NF4 quantization via IPEX-LLM, applies LoRA adapters for parameter-efficient training, loads a separate reference model for the DPO loss computation, and trains using TRL's DPOTrainer. This approach is simpler and more stable than traditional RLHF with PPO, while achieving comparable alignment quality.
Usage
Execute this workflow when you have a preference dataset containing paired chosen/rejected responses (such as Intel/orca_dpo_pairs) and want to align an instruction-tuned model to better match human preferences. Requires an Intel GPU with sufficient memory to hold both the training model and the reference model simultaneously (approximately 2x the single-model memory requirement).
Execution Steps
Step 1: Environment Setup
Configure the Intel GPU runtime environment with oneAPI toolkit variables and XPU settings. Ensure sufficient GPU memory is available for both the main model and the reference model (both loaded in 4-bit quantization).
Key considerations:
- DPO requires approximately 2x GPU memory compared to standard fine-tuning (two model copies)
- Both models are quantized to 4-bit to manage memory constraints
- Single-GPU workflow (no distributed training in this example)
Step 2: Data Preparation
Load the preference dataset and format it into the DPO-expected structure with prompt, chosen, and rejected fields. Apply the model's chat template (e.g., ChatML format) to wrap system messages, user instructions, and responses with appropriate special tokens. Remove original columns and retain only the formatted fields.
Key considerations:
- Dataset must have paired chosen/rejected responses for each prompt
- Chat template formatting ensures proper token boundaries
- The chatml_format function maps system/question/chosen/rejected fields
- Tokenizer padding side should be set to "left" for DPO training
Step 3: Model Loading with Quantization
Load the main training model using IPEX-LLM's AutoModelForCausalLM with BitsAndBytesConfig for NF4 4-bit quantization. Move to XPU device. Prepare for k-bit training and apply LoRA adapters targeting all linear layers. Then load a separate reference model (also in 4-bit) that remains frozen during training and serves as the baseline for the DPO loss calculation.
Key considerations:
- Main model gets LoRA adapters; reference model stays frozen
- Both models use NF4 quantization for memory efficiency
- Reference model uses load_in_low_bit="nf4" with optimize_model=False
- LoRA rank 16 with alpha 16 targets all attention and FFN projection layers
Step 4: DPO Training
Configure TRL's DPOConfig with training hyperparameters including the DPO beta parameter (controlling deviation from the reference policy), prompt and response length limits, learning rate, and optimizer settings. Create a DPOTrainer with both the main and reference models, and launch training. The DPO loss directly optimizes the model to prefer chosen responses over rejected ones relative to the reference model's preferences.
Key considerations:
- Beta parameter (default 0.1) controls KL divergence penalty from reference model
- max_prompt_length and max_length control memory usage during training
- AdamW optimizer used (paged_adamw not yet supported on XPU)
- bf16=True for training stability on Intel GPUs
- Warmup steps help stabilize early training
Step 5: Model Export
Save the trained model (with LoRA adapters) and tokenizer to the output directory. The saved artifacts can be loaded for inference or further merged with the base model.
Key considerations:
- Saves both model weights and tokenizer configuration
- LoRA adapter can be merged with base model for standalone deployment
- Output directory contains the fine-tuned model ready for evaluation