Environment:Intel Ipex llm XPU Serving Environment

Knowledge Sources	IPEX-LLM FastAPI
Domains	Infrastructure, LLM_Serving
Last Updated	2026-02-09 04:00 GMT

Overview

Intel XPU GPU environment for serving LLMs via FastAPI/REST endpoints with IPEX-LLM, supporting both lightweight serving and DeepSpeed AutoTP-based tensor parallel serving.

Description

This environment provides an Intel XPU-accelerated context for serving LLMs through HTTP REST endpoints using FastAPI and Uvicorn. It supports two serving modes: lightweight single-GPU serving with IPEX-LLM optimizations, and multi-GPU tensor parallel serving using DeepSpeed AutoTP. The stack uses `ipex-llm[xpu,serving]` as the core acceleration library, with FastAPI providing the HTTP layer and Uvicorn as the ASGI server. DeepSpeed integration is optional and required only for tensor parallel deployments across multiple Intel GPUs.

Usage

Use this environment for any FastAPI LLM Serving or DeepSpeed AutoTP Serving workflow that requires Intel XPU acceleration. It is the mandatory prerequisite for running lightweight REST-based LLM inference endpoints and distributed tensor parallel serving with DeepSpeed on Intel GPUs.

System Requirements

Category	Requirement	Notes
OS	Ubuntu 22.04 LTS	Intel OneAPI base toolkit required
Hardware	Intel GPU (Arc/Flex/Max)	XPU device; multiple GPUs needed for tensor parallel
GPU Driver	Intel GPU drivers	Level Zero runtime required
Distributed	DeepSpeed (optional)	Required for AutoTP multi-GPU serving

Dependencies

System Packages

Intel OneAPI Base Toolkit
`intel-opencl-icd`
`intel-level-zero-gpu`
`level-zero`

Python Packages

`ipex-llm[xpu,serving]` (pre-release)
`torch` (XPU variant)
`intel_extension_for_pytorch` (XPU variant)
`transformers`
`fastapi`
`uvicorn`
`deepspeed` (optional, for AutoTP tensor parallel serving)
`oneccl_bind_pt` (optional, for multi-GPU communication)

Credentials

No API keys or tokens are required for local serving. The following runtime configuration may be needed:

`SYCL_CACHE_PERSISTENT`: Set to `1` for persistent SYCL compilation cache (faster startup).
`MASTER_PORT`: Communication port for distributed DeepSpeed serving (default: 29500).

Quick Install

# Source Intel OneAPI environment
source /opt/intel/oneapi/setvars.sh

# Install IPEX-LLM with XPU and serving support
pip install --pre --upgrade ipex-llm[xpu,serving] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/

# Install serving dependencies
pip install fastapi uvicorn transformers

# For DeepSpeed AutoTP tensor parallel serving
pip install deepspeed oneccl_bind_pt -f https://developer.intel.com/ipex-whl-stable

# Set runtime environment
export SYCL_CACHE_PERSISTENT=1

Common Errors

Error Message	Cause	Solution
`RuntimeError: No XPU device found`	Intel GPU drivers not installed	Install Intel GPU drivers and Level Zero runtime
`ModuleNotFoundError: No module named 'fastapi'`	FastAPI not installed	`pip install fastapi uvicorn`
`DeepSpeed AutoTP initialization failed`	DeepSpeed or OneCCL not configured	Install DeepSpeed and source OneCCL environment variables

Compatibility Notes

Intel XPU Only: This environment targets Intel Arc, Flex, and Data Center Max GPUs. NVIDIA CUDA GPUs are not supported.
Lightweight vs AutoTP: Lightweight serving runs on a single GPU; AutoTP requires multiple GPUs and DeepSpeed for tensor parallelism.
FastAPI + Uvicorn: The serving layer uses standard Python ASGI tooling, allowing integration with any FastAPI-compatible middleware.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment