Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Environment:Intel Ipex llm vLLM XPU Serving Environment

From Leeroopedia


Knowledge Sources
Domains Infrastructure, LLM_Serving
Last Updated 2026-02-09 12:00 GMT

Overview

Ubuntu 22.04 environment with Intel XPU, vLLM engine, and IPEX-LLM for low-bit quantized LLM serving via offline batch inference or OpenAI-compatible API.

Description

This environment provides an Intel XPU-accelerated context for LLM inference serving using the vLLM engine. It uses `ipex_llm.vllm.xpu.engine.IPEXLLMClass` as a drop-in replacement for vLLM's `LLM` class, enabling FP8/FP6/FP4/INT4 quantized inference on Intel GPUs. The serving stack supports both offline batch generation (via `llm.generate()`) and online API serving via an OpenAI-compatible HTTP server. The environment requires the Intel OneAPI base toolkit with Level Zero runtime and an Intel Arc, Flex, or Data Center Max GPU.

Usage

Use this environment for any vLLM Serving workflow including Offline Batch Inference and Online API Serving. It is the mandatory prerequisite for running the IPEX-LLM vLLM engine class and the OpenAI-compatible API server on Intel XPU hardware.

System Requirements

Category Requirement Notes
OS Ubuntu 22.04 LTS Intel OneAPI base toolkit 2025.0.1+ required
Hardware Intel GPU (Arc/Flex/Max) XPU device required; `device="xpu"` in LLM constructor
GPU Driver Intel GPU drivers `intel-opencl-icd`, `intel-level-zero-gpu` required
Distributed Intel OneCCL Required for tensor parallel serving across multiple GPUs

Dependencies

System Packages

  • Intel OneAPI Base Toolkit 2025.0.1+
  • `intel-opencl-icd`
  • `intel-level-zero-gpu`
  • `level-zero`, `level-zero-dev`
  • `g++-12`, `gcc-12`, `libnuma-dev`

Python Packages

  • `ipex-llm[xpu_2.6]` >= 2.3.0b0
  • `torch` == 2.6.0+xpu
  • `intel_extension_for_pytorch` == 2.6.10+xpu
  • `vllm` == 0.8.3 (custom patched for Intel XPU)
  • `transformers` == 4.53.2
  • `oneccl_bind_pt` == 2.6.0+xpu (for multi-GPU tensor parallel)
  • `transformers_stream_generator`
  • `einops`
  • `tiktoken`

Credentials

The following environment variables must be set:

  • `IPEX_LLM_FORCE_BATCH_FORWARD`: Set to `1` for batch forward optimization on XPU.
  • `VLLM_RPC_TIMEOUT`: Set to `100000` for extended RPC timeout in distributed serving.
  • `SYCL_CACHE_PERSISTENT`: Set to `1` for persistent SYCL compilation cache.

Quick Install

# Source Intel OneAPI environment
source /opt/intel/oneapi/setvars.sh

# Install IPEX-LLM with XPU 2.6 support
pip install --pre --upgrade 'ipex-llm[xpu_2.6]>=2.3.0b0' --extra-index-url https://download.pytorch.org/whl/xpu

# Install Intel Extension for PyTorch
pip install intel-extension-for-pytorch==2.6.10+xpu --extra-index-url=https://pytorch-extension.intel.com/release-whl/stable/xpu/cn/

# Install serving dependencies
pip install transformers==4.53.2 transformers_stream_generator einops tiktoken

# Set runtime environment
export IPEX_LLM_FORCE_BATCH_FORWARD=1
export VLLM_RPC_TIMEOUT=100000
export SYCL_CACHE_PERSISTENT=1

Code Evidence

IPEX-LLM vLLM engine import from `offline_inference.py:34-35`:

from vllm import SamplingParams
from ipex_llm.vllm.xpu.engine import IPEXLLMClass as LLM

XPU device and low-bit configuration from `offline_inference.py:48-55`:

llm = LLM(model="YOUR_MODEL",
          device="xpu",
          dtype="float16",
          enforce_eager=True,
          load_in_low_bit="fp8",
          tensor_parallel_size=1,
          max_model_len=2000,
          max_num_batched_tokens=2000)

Docker environment variables from `docker/llm/serving/xpu/docker/Dockerfile:102`:

ENV TZ=Asia/Shanghai PYTHONUNBUFFERED=1 SYCL_CACHE_PERSISTENT=1
ENV IPEX_LLM_FORCE_BATCH_FORWARD=1 VLLM_RPC_TIMEOUT=100000

Common Errors

Error Message Cause Solution
`ImportError: vllm not found` vLLM not installed or incompatible version Install IPEX-LLM with `[xpu_2.6]` extra which includes patched vLLM
`enforce_eager must be True` Graph mode not supported on XPU Set `enforce_eager=True` in LLM constructor
`VLLM RPC timeout` Distributed serving timeout Increase `VLLM_RPC_TIMEOUT` environment variable
`SYCL kernel compilation slow` First-run JIT compilation Set `SYCL_CACHE_PERSISTENT=1` for caching compiled kernels

Compatibility Notes

  • Intel XPU Only: The `IPEXLLMClass` engine wraps vLLM specifically for Intel XPU. It is not compatible with CUDA devices.
  • Eager Mode Required: `enforce_eager=True` must be set as CUDA graph capture is not supported on XPU.
  • Quantization Formats: Supports `fp8`, `fp6`, `fp4`, `sym_int4`, `asym_int4` via `load_in_low_bit` parameter.
  • vLLM Version: Uses a custom-patched vLLM 0.8.3 specifically modified for Intel multi-Arc GPU support. Standard vLLM builds will not work.
  • Tensor Parallel: Multi-GPU serving requires OneCCL and torch-ccl (built from source for XPU 2.6).

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment