Environment:Intel Ipex llm XPU Inference Environment

Knowledge Sources	IPEX-LLM
Domains	Infrastructure, LLM_Inference
Last Updated	2026-02-09 04:00 GMT

Overview

Intel XPU GPU environment for model inference with IPEX-LLM optimizations, supporting DeepSpeed AutoTP distributed inference, hybrid CPU+GPU inference, and speculative lookahead decoding.

Description

This environment provides an Intel XPU-accelerated context for LLM inference using IPEX-LLM. It supports multiple inference modes: standard single-GPU inference, DeepSpeed AutoTP tensor parallel inference across multiple Intel GPUs, hybrid inference that splits model layers between CPU and GPU for memory-constrained scenarios, and lookahead speculative decoding for improved throughput. The core library is `ipex-llm[xpu]` with PyTorch XPU backend and HuggingFace Transformers for model loading and tokenization.

Usage

Use this environment for any XPU Model Inference workflow including DeepSpeed AutoTP Inference, Hybrid CPU+GPU Inference, and Lookahead Decoding. It is the mandatory prerequisite for running IPEX-LLM optimized model inference on Intel GPUs with various parallelism and decoding strategies.

System Requirements

Category	Requirement	Notes
OS	Ubuntu 22.04 LTS	Intel OneAPI base toolkit required
Hardware	Intel GPU (Arc/Flex/Max)	XPU device; multiple GPUs supported for tensor parallel
GPU Driver	Intel GPU drivers	Level Zero runtime required
RAM	32GB+ recommended	Hybrid inference uses CPU RAM for offloaded layers

Dependencies

System Packages

Intel OneAPI Base Toolkit
`intel-opencl-icd`
`intel-level-zero-gpu`
`level-zero`

Python Packages

`ipex-llm[xpu]` (pre-release)
`torch` (XPU variant)
`intel_extension_for_pytorch` (XPU variant)
`transformers`
`deepspeed` (optional, for AutoTP inference)
`oneccl_bind_pt` (optional, for multi-GPU communication)

Credentials

No API keys or tokens are required for local inference. The following runtime configuration may be needed:

`SYCL_CACHE_PERSISTENT`: Set to `1` for persistent SYCL compilation cache (faster startup).
`MASTER_PORT`: Communication port for distributed DeepSpeed inference (default: 29500).

Quick Install

# Source Intel OneAPI environment
source /opt/intel/oneapi/setvars.sh

# Install IPEX-LLM with XPU support
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/

# Install inference dependencies
pip install transformers

# For DeepSpeed AutoTP distributed inference
pip install deepspeed oneccl_bind_pt -f https://developer.intel.com/ipex-whl-stable

# Set runtime environment
export SYCL_CACHE_PERSISTENT=1

Common Errors

Error Message	Cause	Solution
`RuntimeError: No XPU device found`	Intel GPU drivers not installed	Install Intel GPU drivers and Level Zero runtime
`Out of memory on XPU`	Model too large for single GPU	Use hybrid inference (CPU+GPU) or DeepSpeed AutoTP across multiple GPUs
`DeepSpeed AutoTP initialization failed`	DeepSpeed or OneCCL not configured	Install DeepSpeed and source OneCCL environment variables

Compatibility Notes

Intel XPU Only: This environment targets Intel Arc, Flex, and Data Center Max GPUs. NVIDIA CUDA GPUs are not supported.
Hybrid Inference: When GPU memory is insufficient, layers can be offloaded to CPU RAM transparently via IPEX-LLM hybrid mode.
Lookahead Decoding: Speculative decoding generates multiple candidate tokens per step for improved throughput, requiring additional GPU memory for the speculation buffer.
Warmup Required: The first `model.generate()` call is a warmup for SYCL kernel compilation; timing measurements should use subsequent calls.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment