Workflow:Intel Ipex llm vLLM Serving

Knowledge Sources	IPEX-LLM vLLM Quickstart vLLM Project
Domains	LLMs, Serving, Inference
Last Updated	2026-02-09 04:00 GMT

Overview

End-to-end process for serving Large Language Models on Intel XPU hardware using vLLM with IPEX-LLM backend for high-throughput, low-latency inference.

Description

This workflow sets up a vLLM-based model serving pipeline accelerated by IPEX-LLM on Intel GPUs. It uses the IPEXLLMClass engine wrapper that integrates IPEX-LLM's low-bit quantization (FP8/FP6/FP4/INT4) with vLLM's continuous batching and PagedAttention for efficient multi-request serving. The workflow supports both offline batch inference and online OpenAI-compatible API serving, with tensor parallelism for distributing large models across multiple Intel GPUs. It covers Docker-based deployment, model loading with quantization, sampling parameter configuration, and throughput benchmarking.

Usage

Execute this workflow when you need to deploy an LLM as a production inference service on Intel GPU infrastructure, handling concurrent requests with high throughput. Suitable for both single-GPU serving of smaller models and multi-GPU tensor-parallel serving of larger models (13B+). Works with Intel Arc, Flex, and Max GPU families.

Execution Steps

Step 1: Environment Setup

Prepare the serving environment by pulling the IPEX-LLM vLLM Docker image or installing ipex-llm[xpu] with vllm dependencies. Configure Intel GPU runtime variables (oneAPI toolkit, Level Zero driver). For Docker deployments, mount model directories and configure GPU device passthrough.

Key considerations:

Docker images available for both CPU and XPU (GPU) serving
XPU Docker requires --device /dev/dri passthrough for GPU access
Install ipex-llm[serving] or ipex-llm[xpu] depending on hardware target
Source oneAPI setvars.sh for non-Docker installations

Step 2: Model Selection and Quantization Configuration

Select the target model from HuggingFace Hub or local path and configure the quantization level. IPEX-LLM supports multiple low-bit formats (fp8, fp6, fp4, sym_int4) that can be specified via the load_in_low_bit parameter. Choose the quantization level based on the tradeoff between model quality and memory/speed requirements.

Key considerations:

FP8 provides the best quality-to-compression ratio for serving
SYM_INT4 provides maximum compression for memory-constrained environments
enforce_eager=True is required for XPU (CUDA graph not supported)
max_model_len and max_num_batched_tokens control memory allocation

Step 3: Offline Batch Inference

For batch processing scenarios, create an LLM engine instance with IPEXLLMClass, configure SamplingParams (temperature, top_p, max_tokens), and submit a batch of prompts for parallel generation. Collect and process the RequestOutput objects containing generated text.

Key considerations:

IPEXLLMClass wraps vLLM's LLM class with IPEX-LLM optimization
SamplingParams controls generation behavior (temperature, top_p, top_k, etc.)
dtype should be set to "float16" for XPU compatibility
tensor_parallel_size enables multi-GPU distribution

Step 4: Online API Serving

Launch the vLLM OpenAI-compatible API server using the IPEX-LLM serving entrypoint. The server exposes /v1/completions and /v1/chat/completions endpoints compatible with OpenAI client libraries. Configure host, port, model name, and serving parameters.

Key considerations:

Uses start-vllm-service.sh or python -m ipex_llm.vllm.xpu.entrypoints
Exposes OpenAI-compatible REST API endpoints
Supports streaming responses for real-time applications
GPU memory utilization can be tuned via gpu_memory_utilization parameter

Step 5: Performance Benchmarking

Run throughput and latency benchmarks to validate serving performance. The benchmark suite measures tokens per second, time to first token (TTFT), and request completion latency under various concurrency levels and prompt lengths.

Key considerations:

benchmark_vllm_throughput.py measures sustained throughput
benchmark_vllm_latency.py measures per-request latency
vllm_online_benchmark.py tests real-time serving under load
Compare results across quantization levels to find optimal configuration

Execution Diagram

GitHub URL

Workflow Repository