Workflow:Intel Ipex llm QLoRA Finetuning

Knowledge Sources	IPEX-LLM IPEX-LLM Finetune Guide QLoRA Paper
Domains	LLMs, Fine_Tuning, Quantization
Last Updated	2026-02-09 04:00 GMT

Overview

End-to-end process for parameter-efficient fine-tuning of Large Language Models on Intel GPUs using QLoRA (Quantized Low-Rank Adaptation) with IPEX-LLM.

Description

This workflow covers the complete QLoRA fine-tuning pipeline on Intel XPU hardware. It leverages 4-bit NormalFloat (NF4) quantization via BitsAndBytes to compress the base model, then injects trainable low-rank adapter matrices into the frozen model's attention and feedforward layers. Only the small adapter weights are trained, dramatically reducing memory requirements and enabling fine-tuning of 7B-70B parameter models on Intel Arc, Flex, and Max GPUs. The process covers environment setup, data formatting with prompt templates, model quantization and loading, LoRA adapter injection, distributed training with optional DeepSpeed, and adapter export/merging.

Usage

Execute this workflow when you have an instruction-tuning dataset (such as Alpaca format with instruction/input/output fields) and need to adapt a base LLM (Llama-2, Llama-3, ChatGLM, Qwen, Baichuan, Gemma) to follow domain-specific instructions, while operating under Intel GPU memory constraints (e.g., 16-48GB VRAM per card). Supports single-card and multi-card configurations via DeepSpeed ZeRO Stage 2/3.

Execution Steps

Step 1: Environment and Hardware Setup

Configure the Intel GPU runtime environment by sourcing the oneAPI toolkit variables, setting XPU-specific environment variables (ACCELERATE_USE_XPU, LOCAL_RANK, WORLD_SIZE), and verifying GPU availability. For multi-card training, initialize the distributed backend (oneCCL) and configure DeepSpeed ZeRO Stage 2 or 3 settings.

Key considerations:

Source Intel oneAPI setvars.sh before running
Set ACCELERATE_USE_XPU=true for XPU compatibility with HuggingFace Accelerate
For multi-GPU, use mpirun or deepspeed launcher with appropriate CCL settings
Verify GPU memory availability matches model size requirements

Step 2: Data Preparation

Load the training dataset (from HuggingFace Hub or local JSON/JSONL files) and format each example using a prompt template. The Alpaca prompt template wraps instruction, input, and output fields into a structured format that the model learns to follow. Tokenize the formatted prompts with the model's tokenizer, applying padding and truncation to a fixed cutoff length. Optionally split into training and validation sets.

Key considerations:

Support for multiple prompt templates (alpaca, alpaca_legacy, alpaca_short, vigogne)
Tokenizer pad token must be set (defaults to eos_token for Llama family)
Cutoff length controls maximum sequence length (default 256)
Training-on-inputs flag controls whether loss is computed on the prompt portion

Step 3: Model Loading with 4bit Quantization

Load the base model from HuggingFace Hub or a local checkpoint using IPEX-LLM's AutoModelForCausalLM with BitsAndBytesConfig for 4-bit NF4 quantization. This reduces the model memory footprint by approximately 4x compared to full precision, enabling larger models to fit in GPU memory. Alternatively, load a previously saved low-bit optimized model for faster startup.

Key considerations:

Uses NF4 quantization type as recommended by the QLoRA paper for better quality
Compute dtype is bfloat16 for training stability
Double quantization can optionally be enabled for further memory savings
The lm_head module is excluded from quantization to preserve output quality

Step 4: LoRA Adapter Injection

Prepare the quantized model for k-bit training by freezing base weights and enabling gradient computation on adapter parameters. Configure LoRA hyperparameters (rank, alpha, dropout, target modules) and inject low-rank adapter matrices into the specified model layers. The QLoRA paper recommends targeting all linear layers (q_proj, k_proj, v_proj, o_proj, up_proj, down_proj, gate_proj) for best results.

Key considerations:

IPEX-LLM provides its own qlora-compatible get_peft_model and LoraConfig
Default LoRA rank is 8 with alpha of 16
Typically less than 1% of total parameters are trainable
Gradient checkpointing can be enabled to further reduce memory usage

Step 5: Training Execution

Configure the HuggingFace Trainer with training arguments (batch size, learning rate, scheduler, number of epochs) and launch training. The trainer handles gradient accumulation, mixed-precision training (bf16), evaluation, checkpointing, and optional WandB logging. For distributed training, the CCL backend handles gradient synchronization across Intel GPUs.

Key considerations:

Default learning rate is 3e-5 with cosine scheduler to avoid divergence
Gradient accumulation compensates for small micro-batch sizes (default 2)
AdamW optimizer is used (paged_adamw not yet supported on XPU)
Save checkpoints every 100 steps with total limit of 100
DDP backend must be "ccl" for Intel GPU communication

Step 6: Adapter Export and Model Merging

After training completes, save the LoRA adapter weights to the output directory. Optionally merge the adapter back into the base model to produce a standalone fine-tuned model that can be used without the PEFT library. The merged model retains the original architecture and can be loaded directly for inference.

Key considerations:

Adapter-only save produces small checkpoint files (typically a few hundred MB)
Merging requires reloading the base model at full precision
Merged model can be further quantized for efficient inference
SafeTensors format is not yet supported; uses PyTorch format

Execution Diagram

GitHub URL

Workflow Repository