Environment:Intel Ipex llm vLLM XPU Serving Environment

Knowledge Sources	IPEX-LLM vLLM
Domains	Infrastructure, LLM_Serving
Last Updated	2026-02-09 12:00 GMT

Overview

Ubuntu 22.04 environment with Intel XPU, vLLM engine, and IPEX-LLM for low-bit quantized LLM serving via offline batch inference or OpenAI-compatible API.

Description

This environment provides an Intel XPU-accelerated context for LLM inference serving using the vLLM engine. It uses `ipex_llm.vllm.xpu.engine.IPEXLLMClass` as a drop-in replacement for vLLM's `LLM` class, enabling FP8/FP6/FP4/INT4 quantized inference on Intel GPUs. The serving stack supports both offline batch generation (via `llm.generate()`) and online API serving via an OpenAI-compatible HTTP server. The environment requires the Intel OneAPI base toolkit with Level Zero runtime and an Intel Arc, Flex, or Data Center Max GPU.

Usage

Use this environment for any vLLM Serving workflow including Offline Batch Inference and Online API Serving. It is the mandatory prerequisite for running the IPEX-LLM vLLM engine class and the OpenAI-compatible API server on Intel XPU hardware.

System Requirements

Category	Requirement	Notes
OS	Ubuntu 22.04 LTS	Intel OneAPI base toolkit 2025.0.1+ required
Hardware	Intel GPU (Arc/Flex/Max)	XPU device required; `device="xpu"` in LLM constructor
GPU Driver	Intel GPU drivers	`intel-opencl-icd`, `intel-level-zero-gpu` required
Distributed	Intel OneCCL	Required for tensor parallel serving across multiple GPUs

Dependencies

System Packages

Intel OneAPI Base Toolkit 2025.0.1+
`intel-opencl-icd`
`intel-level-zero-gpu`
`level-zero`, `level-zero-dev`
`g++-12`, `gcc-12`, `libnuma-dev`

Python Packages

`ipex-llm[xpu_2.6]` >= 2.3.0b0
`torch` == 2.6.0+xpu
`intel_extension_for_pytorch` == 2.6.10+xpu
`vllm` == 0.8.3 (custom patched for Intel XPU)
`transformers` == 4.53.2
`oneccl_bind_pt` == 2.6.0+xpu (for multi-GPU tensor parallel)
`transformers_stream_generator`
`einops`
`tiktoken`

Credentials

The following environment variables must be set:

`IPEX_LLM_FORCE_BATCH_FORWARD`: Set to `1` for batch forward optimization on XPU.
`VLLM_RPC_TIMEOUT`: Set to `100000` for extended RPC timeout in distributed serving.
`SYCL_CACHE_PERSISTENT`: Set to `1` for persistent SYCL compilation cache.

Quick Install

# Source Intel OneAPI environment
source /opt/intel/oneapi/setvars.sh

# Install IPEX-LLM with XPU 2.6 support
pip install --pre --upgrade 'ipex-llm[xpu_2.6]>=2.3.0b0' --extra-index-url https://download.pytorch.org/whl/xpu

# Install Intel Extension for PyTorch
pip install intel-extension-for-pytorch==2.6.10+xpu --extra-index-url=https://pytorch-extension.intel.com/release-whl/stable/xpu/cn/

# Install serving dependencies
pip install transformers==4.53.2 transformers_stream_generator einops tiktoken

# Set runtime environment
export IPEX_LLM_FORCE_BATCH_FORWARD=1
export VLLM_RPC_TIMEOUT=100000
export SYCL_CACHE_PERSISTENT=1

Code Evidence

IPEX-LLM vLLM engine import from `offline_inference.py:34-35`:

from vllm import SamplingParams
from ipex_llm.vllm.xpu.engine import IPEXLLMClass as LLM

XPU device and low-bit configuration from `offline_inference.py:48-55`:

llm = LLM(model="YOUR_MODEL",
          device="xpu",
          dtype="float16",
          enforce_eager=True,
          load_in_low_bit="fp8",
          tensor_parallel_size=1,
          max_model_len=2000,
          max_num_batched_tokens=2000)

Docker environment variables from `docker/llm/serving/xpu/docker/Dockerfile:102`:

ENV TZ=Asia/Shanghai PYTHONUNBUFFERED=1 SYCL_CACHE_PERSISTENT=1
ENV IPEX_LLM_FORCE_BATCH_FORWARD=1 VLLM_RPC_TIMEOUT=100000

Common Errors

Error Message	Cause	Solution
`ImportError: vllm not found`	vLLM not installed or incompatible version	Install IPEX-LLM with `[xpu_2.6]` extra which includes patched vLLM
`enforce_eager must be True`	Graph mode not supported on XPU	Set `enforce_eager=True` in LLM constructor
`VLLM RPC timeout`	Distributed serving timeout	Increase `VLLM_RPC_TIMEOUT` environment variable
`SYCL kernel compilation slow`	First-run JIT compilation	Set `SYCL_CACHE_PERSISTENT=1` for caching compiled kernels

Compatibility Notes

Intel XPU Only: The `IPEXLLMClass` engine wraps vLLM specifically for Intel XPU. It is not compatible with CUDA devices.
Eager Mode Required: `enforce_eager=True` must be set as CUDA graph capture is not supported on XPU.
Quantization Formats: Supports `fp8`, `fp6`, `fp4`, `sym_int4`, `asym_int4` via `load_in_low_bit` parameter.
vLLM Version: Uses a custom-patched vLLM 0.8.3 specifically modified for Intel multi-Arc GPU support. Standard vLLM builds will not work.
Tensor Parallel: Multi-GPU serving requires OneCCL and torch-ccl (built from source for XPU 2.6).

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment