Workflow:Intel Ipex llm LoRA Finetuning
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Fine_Tuning |
| Last Updated | 2026-02-09 04:00 GMT |
Overview
End-to-end process for full-precision LoRA (Low-Rank Adaptation) fine-tuning of Large Language Models on Intel GPUs using IPEX-LLM with optional DeepSpeed ZeRO Stage 3.
Description
This workflow covers LoRA fine-tuning without base-model quantization. Unlike QLoRA which loads the base model in 4-bit, this workflow loads the base model in bf16 precision and applies LoRA adapters for parameter-efficient training. This approach trades higher memory usage for potentially better training quality and is suitable when sufficient GPU memory is available (e.g., Intel Max/PVC GPUs or multi-card Arc setups). The workflow supports single-GPU and multi-GPU configurations with DeepSpeed ZeRO Stage 3 for model sharding across cards.
Usage
Execute this workflow when you have sufficient Intel GPU memory to hold the base model in bf16 precision (approximately 14GB for a 7B model) and want to fine-tune with LoRA adapters for maximum training quality. Particularly suited for Intel Max 1550 GPUs with 48GB HBM or multi-card Intel Arc setups with DeepSpeed ZeRO Stage 3 enabled.
Execution Steps
Step 1: Environment and Hardware Setup
Configure the Intel GPU runtime environment including oneAPI toolkit variables, XPU environment settings, and optional DeepSpeed ZeRO Stage 3 configuration. For multi-card training, prepare the CCL distributed backend and mpirun launch configuration with appropriate per-card memory settings.
Key considerations:
- LoRA mode requires more memory than QLoRA since the base model is in bf16
- DeepSpeed ZeRO Stage 3 shards model parameters, gradients, and optimizer states across GPUs
- For single-card use on Arc A770 (16GB), only smaller models (up to ~3B) fit comfortably
- PVC/Max 1550 with 48GB HBM can handle 7B models in single-card LoRA mode
Step 2: Data Preparation
Load and format the training dataset using prompt templates. The data preparation follows the same Alpaca-style formatting as QLoRA, with instruction/input/output fields mapped to a structured prompt template. Tokenization and train/validation splitting are handled identically.
Key considerations:
- Reuses the same common utilities (Prompter, get_train_val_data) as QLoRA
- Identical prompt template support (alpaca, alpaca_legacy, etc.)
- Same tokenizer padding and cutoff length configuration
Step 3: Model Loading in bf16 Precision
Load the base model using IPEX-LLM's AutoModelForCausalLM with load_in_low_bit="bf16" and optimize_model=False. This keeps the base model weights in bfloat16 precision without quantization. Move the model to the target XPU device (unless using DeepSpeed ZeRO Stage 3, which handles device placement).
Key considerations:
- Uses load_in_low_bit="bf16" instead of BitsAndBytesConfig quantization
- The lm_head module is excluded from conversion
- Model is explicitly moved to xpu device for non-ZeRO3 configurations
- Alternatively, load a previously saved low-bit model via load_low_bit()
Step 4: LoRA Adapter Configuration
Prepare the model for training with IPEX-LLM's prepare_model_for_kbit_training and inject LoRA adapters via get_peft_model. The LoRA configuration targets all linear layers with training_mode="lora" (not "qlora"). Adapter parameters are the only trainable weights in the model.
Key considerations:
- Uses training_mode="lora" in LoraConfig (distinct from QLoRA's training_mode="qlora")
- Same target modules as QLoRA (q_proj, k_proj, v_proj, o_proj, up_proj, down_proj, gate_proj)
- IPEX-LLM's qlora module provides get_peft_model for both LoRA and QLoRA modes
Step 5: Training Execution
Launch training using HuggingFace Trainer with cosine learning rate schedule, bf16 mixed precision, and AdamW optimizer. For multi-card setups, the CCL backend handles gradient synchronization. DeepSpeed ZeRO Stage 3 can be enabled for memory-efficient distributed training.
Key considerations:
- Same training hyperparameters as QLoRA (lr=3e-5, cosine schedule, max_grad_norm=0.3)
- DDP backend is "ccl" for Intel XPU communication
- Save checkpoints conditionally based on save_checkpoint flag
- Supports resume_from_checkpoint for interrupted training
Step 6: Adapter Export and Model Merging
Save the trained LoRA adapter weights. Optionally merge the adapter into the base model using the export_merged_model utility to produce a standalone fine-tuned model.
Key considerations:
- Adapter export is identical to QLoRA workflow
- Merged model can be loaded directly without PEFT library dependency
- Uses PyTorch save format (SafeTensors not yet supported)