Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Environment:Intel Ipex llm XPU Finetuning Environment

From Leeroopedia


Knowledge Sources
Domains Infrastructure, LLM_Finetuning
Last Updated 2026-02-09 12:00 GMT

Overview

Ubuntu 22.04 environment with Intel XPU (Arc/Flex/Max GPU), PyTorch 2.1+, IPEX-LLM, and HuggingFace ecosystem for QLoRA, LoRA, and DPO finetuning workflows.

Description

This environment provides an Intel XPU-accelerated context for LLM finetuning. It is built on the Intel OneAPI base toolkit and requires Intel GPU drivers (Arc, Flex, or Data Center Max series). The stack includes IPEX-LLM as the core acceleration library, with ipex_llm.transformers providing drop-in replacements for HuggingFace AutoModelForCausalLM. The environment supports 4-bit NF4 quantization (QLoRA), bf16 full-precision (LoRA), and DPO training modes. Distributed multi-GPU training uses Intel OneCCL as the communication backend rather than NVIDIA's NCCL.

Usage

Use this environment for any QLoRA Finetuning, LoRA Finetuning, or DPO Finetuning workflow that requires Intel XPU acceleration. It is the mandatory prerequisite for running the IPEX-LLM compatible Trainer implementations including QLoRA with BitsAndBytesConfig, bf16 LoRA with DeepSpeed ZeRO3, and DPO with TRL's DPOTrainer.

System Requirements

Category Requirement Notes
OS Ubuntu 22.04 LTS Intel OneAPI base toolkit required
Hardware Intel GPU (Arc/Flex/Max) XPU device; iGPU also supported for smaller models
GPU Driver Intel GPU drivers Level Zero runtime required
Distributed Intel OneCCL Required for multi-GPU DDP training (replaces NCCL)

Dependencies

System Packages

  • Intel OneAPI Base Toolkit 2024.0.1+
  • `intel-opencl-icd`
  • `intel-level-zero-gpu`
  • `level-zero`, `level-zero-dev`

Python Packages

  • `ipex-llm[xpu]` (pre-release)
  • `torch` == 2.1.0a0 (XPU 2.1) or == 2.6.0+xpu (XPU 2.6)
  • `intel_extension_for_pytorch` == 2.1.10+xpu or == 2.6.10+xpu
  • `transformers` == 4.36.0 (finetuning) or == 4.53.2 (serving)
  • `peft` == 0.10.0
  • `bitsandbytes`
  • `accelerate` == 0.23.0
  • `datasets`
  • `scipy`
  • `fire`
  • `trl` >= 0.7.9, <= 0.9.6 (for DPO)
  • `deepspeed` >= 0.13.1 (for distributed LoRA)
  • `oneccl_bind_pt` (for multi-GPU DDP)

Credentials

The following environment variables must be set:

  • `ACCELERATE_USE_XPU`: Must be set to `"true"` before importing accelerate. Enables Intel XPU device detection in HuggingFace Accelerate.
  • `LOCAL_RANK`: GPU rank for distributed training. Also read from `MPI_LOCALRANKID` (Intel MPI).
  • `WORLD_SIZE`: Total number of GPUs. Also read from `PMI_SIZE` (Intel MPI).
  • `RANK`: Process rank for DDP.
  • `MASTER_PORT`: Communication port (default: 29500).
  • `WANDB_PROJECT`: (Optional) Weights & Biases project name for logging.
  • `WANDB_WATCH`: (Optional) W&B gradient watching mode (`false`, `gradients`, `all`).
  • `WANDB_LOG_MODEL`: (Optional) W&B model logging (`false`, `true`).
  • `SYCL_CACHE_PERSISTENT`: Set to `1` for persistent SYCL compilation cache (faster startup).

Quick Install

# Source Intel OneAPI environment
source /opt/intel/oneapi/setvars.sh

# Set XPU for Accelerate (must be before import)
export ACCELERATE_USE_XPU=true

# Install IPEX-LLM with XPU support
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/

# Install finetuning dependencies
pip install transformers==4.36.0 peft==0.10.0 datasets bitsandbytes scipy fire accelerate==0.23.0

# For DPO training
pip install trl>=0.7.9

# For distributed multi-GPU training
pip install oneccl_bind_pt -f https://developer.intel.com/ipex-whl-stable
pip install deepspeed>=0.13.1

Code Evidence

Environment variable setup from `alpaca_qlora_finetuning.py:35`:

os.environ["ACCELERATE_USE_XPU"] = "true"

Distributed environment detection from `alpaca_qlora_finetuning.py:61-67`:

local_rank = get_int_from_env(["LOCAL_RANK","MPI_LOCALRANKID"], "0")
world_size = get_int_from_env(["WORLD_SIZE","PMI_SIZE"], "1")
port = get_int_from_env(["MASTER_PORT"], 29500)
os.environ["LOCAL_RANK"] = str(local_rank)
os.environ["WORLD_SIZE"] = str(world_size)
os.environ["RANK"] = str(local_rank)
os.environ["MASTER_PORT"] = str(port)

XPU device placement from `alpaca_qlora_finetuning.py:199`:

model = model.to(f'xpu:{os.environ.get("LOCAL_RANK", 0)}')

CCL DDP backend from `alpaca_qlora_finetuning.py:268`:

ddp_backend="ccl",

W&B environment check from `common/utils/util.py:63-75`:

def wandb_check(wandb_project, wandb_watch, wandb_log_model):
    use_wandb = len(wandb_project) > 0 or (
        "WANDB_PROJECT" in os.environ and len(os.environ["WANDB_PROJECT"]) > 0
    )
    if len(wandb_project) > 0:
        os.environ["WANDB_PROJECT"] = wandb_project
    if len(wandb_watch) > 0:
        os.environ["WANDB_WATCH"] = wandb_watch
    if len(wandb_log_model) > 0:
        os.environ["WANDB_LOG_MODEL"] = wandb_log_model
    return use_wandb

Common Errors

Error Message Cause Solution
`ACCELERATE_USE_XPU not set` XPU environment variable not configured Set `export ACCELERATE_USE_XPU=true` before importing accelerate
`RuntimeError: No XPU device found` Intel GPU drivers not installed Install Intel GPU drivers and Level Zero runtime
`ModuleNotFoundError: No module named 'oneccl_bindings_for_pytorch'` OneCCL not installed `pip install oneccl_bind_pt` from Intel index
`DDP backend 'ccl' not available` OneCCL environment not sourced `source /opt/intel/oneapi/ccl/latest/env/vars.sh --force`
`paged_adamw_8bit is not supported yet` Paged AdamW not available on Intel platform Use `optim="adamw_torch"` or `optim="adamw_hf"` instead

Compatibility Notes

  • Intel XPU Only: This environment targets Intel Arc, Flex, and Data Center Max GPUs. NVIDIA CUDA GPUs are not supported.
  • CCL vs NCCL: Multi-GPU training uses Intel OneCCL (`ddp_backend="ccl"`) instead of NVIDIA NCCL. The OneAPI CCL environment must be sourced before training.
  • DeepSpeed ZeRO3: Requires IPEX-LLM compatibility patches for `_constant_buffered_norm2`. Applied automatically when `deepspeed` config contains `"zero3"`.
  • SafeTensors: Checkpoint saving uses `save_safetensors=False` for compatibility.
  • PyTorch Version: Two XPU variants exist: PyTorch 2.1 (legacy) and PyTorch 2.6 (recommended). Package versions must match exactly.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment