Environment:Intel Ipex llm XPU Inference Environment
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, LLM_Inference |
| Last Updated | 2026-02-09 04:00 GMT |
Overview
Intel XPU GPU environment for model inference with IPEX-LLM optimizations, supporting DeepSpeed AutoTP distributed inference, hybrid CPU+GPU inference, and speculative lookahead decoding.
Description
This environment provides an Intel XPU-accelerated context for LLM inference using IPEX-LLM. It supports multiple inference modes: standard single-GPU inference, DeepSpeed AutoTP tensor parallel inference across multiple Intel GPUs, hybrid inference that splits model layers between CPU and GPU for memory-constrained scenarios, and lookahead speculative decoding for improved throughput. The core library is `ipex-llm[xpu]` with PyTorch XPU backend and HuggingFace Transformers for model loading and tokenization.
Usage
Use this environment for any XPU Model Inference workflow including DeepSpeed AutoTP Inference, Hybrid CPU+GPU Inference, and Lookahead Decoding. It is the mandatory prerequisite for running IPEX-LLM optimized model inference on Intel GPUs with various parallelism and decoding strategies.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Ubuntu 22.04 LTS | Intel OneAPI base toolkit required |
| Hardware | Intel GPU (Arc/Flex/Max) | XPU device; multiple GPUs supported for tensor parallel |
| GPU Driver | Intel GPU drivers | Level Zero runtime required |
| RAM | 32GB+ recommended | Hybrid inference uses CPU RAM for offloaded layers |
Dependencies
System Packages
- Intel OneAPI Base Toolkit
- `intel-opencl-icd`
- `intel-level-zero-gpu`
- `level-zero`
Python Packages
- `ipex-llm[xpu]` (pre-release)
- `torch` (XPU variant)
- `intel_extension_for_pytorch` (XPU variant)
- `transformers`
- `deepspeed` (optional, for AutoTP inference)
- `oneccl_bind_pt` (optional, for multi-GPU communication)
Credentials
No API keys or tokens are required for local inference. The following runtime configuration may be needed:
- `SYCL_CACHE_PERSISTENT`: Set to `1` for persistent SYCL compilation cache (faster startup).
- `MASTER_PORT`: Communication port for distributed DeepSpeed inference (default: 29500).
Quick Install
# Source Intel OneAPI environment
source /opt/intel/oneapi/setvars.sh
# Install IPEX-LLM with XPU support
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
# Install inference dependencies
pip install transformers
# For DeepSpeed AutoTP distributed inference
pip install deepspeed oneccl_bind_pt -f https://developer.intel.com/ipex-whl-stable
# Set runtime environment
export SYCL_CACHE_PERSISTENT=1
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `RuntimeError: No XPU device found` | Intel GPU drivers not installed | Install Intel GPU drivers and Level Zero runtime |
| `Out of memory on XPU` | Model too large for single GPU | Use hybrid inference (CPU+GPU) or DeepSpeed AutoTP across multiple GPUs |
| `DeepSpeed AutoTP initialization failed` | DeepSpeed or OneCCL not configured | Install DeepSpeed and source OneCCL environment variables |
Compatibility Notes
- Intel XPU Only: This environment targets Intel Arc, Flex, and Data Center Max GPUs. NVIDIA CUDA GPUs are not supported.
- Hybrid Inference: When GPU memory is insufficient, layers can be offloaded to CPU RAM transparently via IPEX-LLM hybrid mode.
- Lookahead Decoding: Speculative decoding generates multiple candidate tokens per step for improved throughput, requiring additional GPU memory for the speculation buffer.
- Warmup Required: The first `model.generate()` call is a warmup for SYCL kernel compilation; timing measurements should use subsequent calls.