Workflow:Intel Ipex llm QLoRA Finetuning
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Fine_Tuning, Quantization |
| Last Updated | 2026-02-09 04:00 GMT |
Overview
End-to-end process for parameter-efficient fine-tuning of Large Language Models on Intel GPUs using QLoRA (Quantized Low-Rank Adaptation) with IPEX-LLM.
Description
This workflow covers the complete QLoRA fine-tuning pipeline on Intel XPU hardware. It leverages 4-bit NormalFloat (NF4) quantization via BitsAndBytes to compress the base model, then injects trainable low-rank adapter matrices into the frozen model's attention and feedforward layers. Only the small adapter weights are trained, dramatically reducing memory requirements and enabling fine-tuning of 7B-70B parameter models on Intel Arc, Flex, and Max GPUs. The process covers environment setup, data formatting with prompt templates, model quantization and loading, LoRA adapter injection, distributed training with optional DeepSpeed, and adapter export/merging.
Usage
Execute this workflow when you have an instruction-tuning dataset (such as Alpaca format with instruction/input/output fields) and need to adapt a base LLM (Llama-2, Llama-3, ChatGLM, Qwen, Baichuan, Gemma) to follow domain-specific instructions, while operating under Intel GPU memory constraints (e.g., 16-48GB VRAM per card). Supports single-card and multi-card configurations via DeepSpeed ZeRO Stage 2/3.
Execution Steps
Step 1: Environment and Hardware Setup
Configure the Intel GPU runtime environment by sourcing the oneAPI toolkit variables, setting XPU-specific environment variables (ACCELERATE_USE_XPU, LOCAL_RANK, WORLD_SIZE), and verifying GPU availability. For multi-card training, initialize the distributed backend (oneCCL) and configure DeepSpeed ZeRO Stage 2 or 3 settings.
Key considerations:
- Source Intel oneAPI setvars.sh before running
- Set ACCELERATE_USE_XPU=true for XPU compatibility with HuggingFace Accelerate
- For multi-GPU, use mpirun or deepspeed launcher with appropriate CCL settings
- Verify GPU memory availability matches model size requirements
Step 2: Data Preparation
Load the training dataset (from HuggingFace Hub or local JSON/JSONL files) and format each example using a prompt template. The Alpaca prompt template wraps instruction, input, and output fields into a structured format that the model learns to follow. Tokenize the formatted prompts with the model's tokenizer, applying padding and truncation to a fixed cutoff length. Optionally split into training and validation sets.
Key considerations:
- Support for multiple prompt templates (alpaca, alpaca_legacy, alpaca_short, vigogne)
- Tokenizer pad token must be set (defaults to eos_token for Llama family)
- Cutoff length controls maximum sequence length (default 256)
- Training-on-inputs flag controls whether loss is computed on the prompt portion
Step 3: Model Loading with 4bit Quantization
Load the base model from HuggingFace Hub or a local checkpoint using IPEX-LLM's AutoModelForCausalLM with BitsAndBytesConfig for 4-bit NF4 quantization. This reduces the model memory footprint by approximately 4x compared to full precision, enabling larger models to fit in GPU memory. Alternatively, load a previously saved low-bit optimized model for faster startup.
Key considerations:
- Uses NF4 quantization type as recommended by the QLoRA paper for better quality
- Compute dtype is bfloat16 for training stability
- Double quantization can optionally be enabled for further memory savings
- The lm_head module is excluded from quantization to preserve output quality
Step 4: LoRA Adapter Injection
Prepare the quantized model for k-bit training by freezing base weights and enabling gradient computation on adapter parameters. Configure LoRA hyperparameters (rank, alpha, dropout, target modules) and inject low-rank adapter matrices into the specified model layers. The QLoRA paper recommends targeting all linear layers (q_proj, k_proj, v_proj, o_proj, up_proj, down_proj, gate_proj) for best results.
Key considerations:
- IPEX-LLM provides its own qlora-compatible get_peft_model and LoraConfig
- Default LoRA rank is 8 with alpha of 16
- Typically less than 1% of total parameters are trainable
- Gradient checkpointing can be enabled to further reduce memory usage
Step 5: Training Execution
Configure the HuggingFace Trainer with training arguments (batch size, learning rate, scheduler, number of epochs) and launch training. The trainer handles gradient accumulation, mixed-precision training (bf16), evaluation, checkpointing, and optional WandB logging. For distributed training, the CCL backend handles gradient synchronization across Intel GPUs.
Key considerations:
- Default learning rate is 3e-5 with cosine scheduler to avoid divergence
- Gradient accumulation compensates for small micro-batch sizes (default 2)
- AdamW optimizer is used (paged_adamw not yet supported on XPU)
- Save checkpoints every 100 steps with total limit of 100
- DDP backend must be "ccl" for Intel GPU communication
Step 6: Adapter Export and Model Merging
After training completes, save the LoRA adapter weights to the output directory. Optionally merge the adapter back into the base model to produce a standalone fine-tuned model that can be used without the PEFT library. The merged model retains the original architecture and can be loaded directly for inference.
Key considerations:
- Adapter-only save produces small checkpoint files (typically a few hundred MB)
- Merging requires reloading the base model at full precision
- Merged model can be further quantized for efficient inference
- SafeTensors format is not yet supported; uses PyTorch format