Environment:Intel Ipex llm XPU Serving Environment
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, LLM_Serving |
| Last Updated | 2026-02-09 04:00 GMT |
Overview
Intel XPU GPU environment for serving LLMs via FastAPI/REST endpoints with IPEX-LLM, supporting both lightweight serving and DeepSpeed AutoTP-based tensor parallel serving.
Description
This environment provides an Intel XPU-accelerated context for serving LLMs through HTTP REST endpoints using FastAPI and Uvicorn. It supports two serving modes: lightweight single-GPU serving with IPEX-LLM optimizations, and multi-GPU tensor parallel serving using DeepSpeed AutoTP. The stack uses `ipex-llm[xpu,serving]` as the core acceleration library, with FastAPI providing the HTTP layer and Uvicorn as the ASGI server. DeepSpeed integration is optional and required only for tensor parallel deployments across multiple Intel GPUs.
Usage
Use this environment for any FastAPI LLM Serving or DeepSpeed AutoTP Serving workflow that requires Intel XPU acceleration. It is the mandatory prerequisite for running lightweight REST-based LLM inference endpoints and distributed tensor parallel serving with DeepSpeed on Intel GPUs.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Ubuntu 22.04 LTS | Intel OneAPI base toolkit required |
| Hardware | Intel GPU (Arc/Flex/Max) | XPU device; multiple GPUs needed for tensor parallel |
| GPU Driver | Intel GPU drivers | Level Zero runtime required |
| Distributed | DeepSpeed (optional) | Required for AutoTP multi-GPU serving |
Dependencies
System Packages
- Intel OneAPI Base Toolkit
- `intel-opencl-icd`
- `intel-level-zero-gpu`
- `level-zero`
Python Packages
- `ipex-llm[xpu,serving]` (pre-release)
- `torch` (XPU variant)
- `intel_extension_for_pytorch` (XPU variant)
- `transformers`
- `fastapi`
- `uvicorn`
- `deepspeed` (optional, for AutoTP tensor parallel serving)
- `oneccl_bind_pt` (optional, for multi-GPU communication)
Credentials
No API keys or tokens are required for local serving. The following runtime configuration may be needed:
- `SYCL_CACHE_PERSISTENT`: Set to `1` for persistent SYCL compilation cache (faster startup).
- `MASTER_PORT`: Communication port for distributed DeepSpeed serving (default: 29500).
Quick Install
# Source Intel OneAPI environment
source /opt/intel/oneapi/setvars.sh
# Install IPEX-LLM with XPU and serving support
pip install --pre --upgrade ipex-llm[xpu,serving] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
# Install serving dependencies
pip install fastapi uvicorn transformers
# For DeepSpeed AutoTP tensor parallel serving
pip install deepspeed oneccl_bind_pt -f https://developer.intel.com/ipex-whl-stable
# Set runtime environment
export SYCL_CACHE_PERSISTENT=1
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `RuntimeError: No XPU device found` | Intel GPU drivers not installed | Install Intel GPU drivers and Level Zero runtime |
| `ModuleNotFoundError: No module named 'fastapi'` | FastAPI not installed | `pip install fastapi uvicorn` |
| `DeepSpeed AutoTP initialization failed` | DeepSpeed or OneCCL not configured | Install DeepSpeed and source OneCCL environment variables |
Compatibility Notes
- Intel XPU Only: This environment targets Intel Arc, Flex, and Data Center Max GPUs. NVIDIA CUDA GPUs are not supported.
- Lightweight vs AutoTP: Lightweight serving runs on a single GPU; AutoTP requires multiple GPUs and DeepSpeed for tensor parallelism.
- FastAPI + Uvicorn: The serving layer uses standard Python ASGI tooling, allowing integration with any FastAPI-compatible middleware.