Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Environment:Intel Ipex llm XPU Inference Environment

From Leeroopedia


Knowledge Sources
Domains Infrastructure, LLM_Inference
Last Updated 2026-02-09 04:00 GMT

Overview

Intel XPU GPU environment for model inference with IPEX-LLM optimizations, supporting DeepSpeed AutoTP distributed inference, hybrid CPU+GPU inference, and speculative lookahead decoding.

Description

This environment provides an Intel XPU-accelerated context for LLM inference using IPEX-LLM. It supports multiple inference modes: standard single-GPU inference, DeepSpeed AutoTP tensor parallel inference across multiple Intel GPUs, hybrid inference that splits model layers between CPU and GPU for memory-constrained scenarios, and lookahead speculative decoding for improved throughput. The core library is `ipex-llm[xpu]` with PyTorch XPU backend and HuggingFace Transformers for model loading and tokenization.

Usage

Use this environment for any XPU Model Inference workflow including DeepSpeed AutoTP Inference, Hybrid CPU+GPU Inference, and Lookahead Decoding. It is the mandatory prerequisite for running IPEX-LLM optimized model inference on Intel GPUs with various parallelism and decoding strategies.

System Requirements

Category Requirement Notes
OS Ubuntu 22.04 LTS Intel OneAPI base toolkit required
Hardware Intel GPU (Arc/Flex/Max) XPU device; multiple GPUs supported for tensor parallel
GPU Driver Intel GPU drivers Level Zero runtime required
RAM 32GB+ recommended Hybrid inference uses CPU RAM for offloaded layers

Dependencies

System Packages

  • Intel OneAPI Base Toolkit
  • `intel-opencl-icd`
  • `intel-level-zero-gpu`
  • `level-zero`

Python Packages

  • `ipex-llm[xpu]` (pre-release)
  • `torch` (XPU variant)
  • `intel_extension_for_pytorch` (XPU variant)
  • `transformers`
  • `deepspeed` (optional, for AutoTP inference)
  • `oneccl_bind_pt` (optional, for multi-GPU communication)

Credentials

No API keys or tokens are required for local inference. The following runtime configuration may be needed:

  • `SYCL_CACHE_PERSISTENT`: Set to `1` for persistent SYCL compilation cache (faster startup).
  • `MASTER_PORT`: Communication port for distributed DeepSpeed inference (default: 29500).

Quick Install

# Source Intel OneAPI environment
source /opt/intel/oneapi/setvars.sh

# Install IPEX-LLM with XPU support
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/

# Install inference dependencies
pip install transformers

# For DeepSpeed AutoTP distributed inference
pip install deepspeed oneccl_bind_pt -f https://developer.intel.com/ipex-whl-stable

# Set runtime environment
export SYCL_CACHE_PERSISTENT=1

Common Errors

Error Message Cause Solution
`RuntimeError: No XPU device found` Intel GPU drivers not installed Install Intel GPU drivers and Level Zero runtime
`Out of memory on XPU` Model too large for single GPU Use hybrid inference (CPU+GPU) or DeepSpeed AutoTP across multiple GPUs
`DeepSpeed AutoTP initialization failed` DeepSpeed or OneCCL not configured Install DeepSpeed and source OneCCL environment variables

Compatibility Notes

  • Intel XPU Only: This environment targets Intel Arc, Flex, and Data Center Max GPUs. NVIDIA CUDA GPUs are not supported.
  • Hybrid Inference: When GPU memory is insufficient, layers can be offloaded to CPU RAM transparently via IPEX-LLM hybrid mode.
  • Lookahead Decoding: Speculative decoding generates multiple candidate tokens per step for improved throughput, requiring additional GPU memory for the speculation buffer.
  • Warmup Required: The first `model.generate()` call is a warmup for SYCL kernel compilation; timing measurements should use subsequent calls.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment