Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Intel Ipex llm XPU Serving Environment

From Leeroopedia


Knowledge Sources
Domains Infrastructure, LLM_Serving
Last Updated 2026-02-09 04:00 GMT

Overview

Intel XPU GPU environment for serving LLMs via FastAPI/REST endpoints with IPEX-LLM, supporting both lightweight serving and DeepSpeed AutoTP-based tensor parallel serving.

Description

This environment provides an Intel XPU-accelerated context for serving LLMs through HTTP REST endpoints using FastAPI and Uvicorn. It supports two serving modes: lightweight single-GPU serving with IPEX-LLM optimizations, and multi-GPU tensor parallel serving using DeepSpeed AutoTP. The stack uses `ipex-llm[xpu,serving]` as the core acceleration library, with FastAPI providing the HTTP layer and Uvicorn as the ASGI server. DeepSpeed integration is optional and required only for tensor parallel deployments across multiple Intel GPUs.

Usage

Use this environment for any FastAPI LLM Serving or DeepSpeed AutoTP Serving workflow that requires Intel XPU acceleration. It is the mandatory prerequisite for running lightweight REST-based LLM inference endpoints and distributed tensor parallel serving with DeepSpeed on Intel GPUs.

System Requirements

Category Requirement Notes
OS Ubuntu 22.04 LTS Intel OneAPI base toolkit required
Hardware Intel GPU (Arc/Flex/Max) XPU device; multiple GPUs needed for tensor parallel
GPU Driver Intel GPU drivers Level Zero runtime required
Distributed DeepSpeed (optional) Required for AutoTP multi-GPU serving

Dependencies

System Packages

  • Intel OneAPI Base Toolkit
  • `intel-opencl-icd`
  • `intel-level-zero-gpu`
  • `level-zero`

Python Packages

  • `ipex-llm[xpu,serving]` (pre-release)
  • `torch` (XPU variant)
  • `intel_extension_for_pytorch` (XPU variant)
  • `transformers`
  • `fastapi`
  • `uvicorn`
  • `deepspeed` (optional, for AutoTP tensor parallel serving)
  • `oneccl_bind_pt` (optional, for multi-GPU communication)

Credentials

No API keys or tokens are required for local serving. The following runtime configuration may be needed:

  • `SYCL_CACHE_PERSISTENT`: Set to `1` for persistent SYCL compilation cache (faster startup).
  • `MASTER_PORT`: Communication port for distributed DeepSpeed serving (default: 29500).

Quick Install

# Source Intel OneAPI environment
source /opt/intel/oneapi/setvars.sh

# Install IPEX-LLM with XPU and serving support
pip install --pre --upgrade ipex-llm[xpu,serving] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/

# Install serving dependencies
pip install fastapi uvicorn transformers

# For DeepSpeed AutoTP tensor parallel serving
pip install deepspeed oneccl_bind_pt -f https://developer.intel.com/ipex-whl-stable

# Set runtime environment
export SYCL_CACHE_PERSISTENT=1

Common Errors

Error Message Cause Solution
`RuntimeError: No XPU device found` Intel GPU drivers not installed Install Intel GPU drivers and Level Zero runtime
`ModuleNotFoundError: No module named 'fastapi'` FastAPI not installed `pip install fastapi uvicorn`
`DeepSpeed AutoTP initialization failed` DeepSpeed or OneCCL not configured Install DeepSpeed and source OneCCL environment variables

Compatibility Notes

  • Intel XPU Only: This environment targets Intel Arc, Flex, and Data Center Max GPUs. NVIDIA CUDA GPUs are not supported.
  • Lightweight vs AutoTP: Lightweight serving runs on a single GPU; AutoTP requires multiple GPUs and DeepSpeed for tensor parallelism.
  • FastAPI + Uvicorn: The serving layer uses standard Python ASGI tooling, allowing integration with any FastAPI-compatible middleware.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment