Workflow:Intel Ipex llm LoRA Finetuning

Knowledge Sources	IPEX-LLM IPEX-LLM Finetune Guide
Domains	LLMs, Fine_Tuning
Last Updated	2026-02-09 04:00 GMT

Overview

End-to-end process for full-precision LoRA (Low-Rank Adaptation) fine-tuning of Large Language Models on Intel GPUs using IPEX-LLM with optional DeepSpeed ZeRO Stage 3.

Description

This workflow covers LoRA fine-tuning without base-model quantization. Unlike QLoRA which loads the base model in 4-bit, this workflow loads the base model in bf16 precision and applies LoRA adapters for parameter-efficient training. This approach trades higher memory usage for potentially better training quality and is suitable when sufficient GPU memory is available (e.g., Intel Max/PVC GPUs or multi-card Arc setups). The workflow supports single-GPU and multi-GPU configurations with DeepSpeed ZeRO Stage 3 for model sharding across cards.

Usage

Execute this workflow when you have sufficient Intel GPU memory to hold the base model in bf16 precision (approximately 14GB for a 7B model) and want to fine-tune with LoRA adapters for maximum training quality. Particularly suited for Intel Max 1550 GPUs with 48GB HBM or multi-card Intel Arc setups with DeepSpeed ZeRO Stage 3 enabled.

Execution Steps

Step 1: Environment and Hardware Setup

Configure the Intel GPU runtime environment including oneAPI toolkit variables, XPU environment settings, and optional DeepSpeed ZeRO Stage 3 configuration. For multi-card training, prepare the CCL distributed backend and mpirun launch configuration with appropriate per-card memory settings.

Key considerations:

LoRA mode requires more memory than QLoRA since the base model is in bf16
DeepSpeed ZeRO Stage 3 shards model parameters, gradients, and optimizer states across GPUs
For single-card use on Arc A770 (16GB), only smaller models (up to ~3B) fit comfortably
PVC/Max 1550 with 48GB HBM can handle 7B models in single-card LoRA mode

Step 2: Data Preparation

Load and format the training dataset using prompt templates. The data preparation follows the same Alpaca-style formatting as QLoRA, with instruction/input/output fields mapped to a structured prompt template. Tokenization and train/validation splitting are handled identically.

Key considerations:

Reuses the same common utilities (Prompter, get_train_val_data) as QLoRA
Identical prompt template support (alpaca, alpaca_legacy, etc.)
Same tokenizer padding and cutoff length configuration

Step 3: Model Loading in bf16 Precision

Load the base model using IPEX-LLM's AutoModelForCausalLM with load_in_low_bit="bf16" and optimize_model=False. This keeps the base model weights in bfloat16 precision without quantization. Move the model to the target XPU device (unless using DeepSpeed ZeRO Stage 3, which handles device placement).

Key considerations:

Uses load_in_low_bit="bf16" instead of BitsAndBytesConfig quantization
The lm_head module is excluded from conversion
Model is explicitly moved to xpu device for non-ZeRO3 configurations
Alternatively, load a previously saved low-bit model via load_low_bit()

Step 4: LoRA Adapter Configuration

Prepare the model for training with IPEX-LLM's prepare_model_for_kbit_training and inject LoRA adapters via get_peft_model. The LoRA configuration targets all linear layers with training_mode="lora" (not "qlora"). Adapter parameters are the only trainable weights in the model.

Key considerations:

Uses training_mode="lora" in LoraConfig (distinct from QLoRA's training_mode="qlora")
Same target modules as QLoRA (q_proj, k_proj, v_proj, o_proj, up_proj, down_proj, gate_proj)
IPEX-LLM's qlora module provides get_peft_model for both LoRA and QLoRA modes

Step 5: Training Execution

Launch training using HuggingFace Trainer with cosine learning rate schedule, bf16 mixed precision, and AdamW optimizer. For multi-card setups, the CCL backend handles gradient synchronization. DeepSpeed ZeRO Stage 3 can be enabled for memory-efficient distributed training.

Key considerations:

Same training hyperparameters as QLoRA (lr=3e-5, cosine schedule, max_grad_norm=0.3)
DDP backend is "ccl" for Intel XPU communication
Save checkpoints conditionally based on save_checkpoint flag
Supports resume_from_checkpoint for interrupted training

Step 6: Adapter Export and Model Merging

Save the trained LoRA adapter weights. Optionally merge the adapter into the base model using the export_merged_model utility to produce a standalone fine-tuned model.

Key considerations:

Adapter export is identical to QLoRA workflow
Merged model can be loaded directly without PEFT library dependency
Uses PyTorch save format (SafeTensors not yet supported)

Execution Diagram

GitHub URL

Workflow Repository