Workflow:Intel Ipex llm vLLM Serving
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Serving, Inference |
| Last Updated | 2026-02-09 04:00 GMT |
Overview
End-to-end process for serving Large Language Models on Intel XPU hardware using vLLM with IPEX-LLM backend for high-throughput, low-latency inference.
Description
This workflow sets up a vLLM-based model serving pipeline accelerated by IPEX-LLM on Intel GPUs. It uses the IPEXLLMClass engine wrapper that integrates IPEX-LLM's low-bit quantization (FP8/FP6/FP4/INT4) with vLLM's continuous batching and PagedAttention for efficient multi-request serving. The workflow supports both offline batch inference and online OpenAI-compatible API serving, with tensor parallelism for distributing large models across multiple Intel GPUs. It covers Docker-based deployment, model loading with quantization, sampling parameter configuration, and throughput benchmarking.
Usage
Execute this workflow when you need to deploy an LLM as a production inference service on Intel GPU infrastructure, handling concurrent requests with high throughput. Suitable for both single-GPU serving of smaller models and multi-GPU tensor-parallel serving of larger models (13B+). Works with Intel Arc, Flex, and Max GPU families.
Execution Steps
Step 1: Environment Setup
Prepare the serving environment by pulling the IPEX-LLM vLLM Docker image or installing ipex-llm[xpu] with vllm dependencies. Configure Intel GPU runtime variables (oneAPI toolkit, Level Zero driver). For Docker deployments, mount model directories and configure GPU device passthrough.
Key considerations:
- Docker images available for both CPU and XPU (GPU) serving
- XPU Docker requires --device /dev/dri passthrough for GPU access
- Install ipex-llm[serving] or ipex-llm[xpu] depending on hardware target
- Source oneAPI setvars.sh for non-Docker installations
Step 2: Model Selection and Quantization Configuration
Select the target model from HuggingFace Hub or local path and configure the quantization level. IPEX-LLM supports multiple low-bit formats (fp8, fp6, fp4, sym_int4) that can be specified via the load_in_low_bit parameter. Choose the quantization level based on the tradeoff between model quality and memory/speed requirements.
Key considerations:
- FP8 provides the best quality-to-compression ratio for serving
- SYM_INT4 provides maximum compression for memory-constrained environments
- enforce_eager=True is required for XPU (CUDA graph not supported)
- max_model_len and max_num_batched_tokens control memory allocation
Step 3: Offline Batch Inference
For batch processing scenarios, create an LLM engine instance with IPEXLLMClass, configure SamplingParams (temperature, top_p, max_tokens), and submit a batch of prompts for parallel generation. Collect and process the RequestOutput objects containing generated text.
Key considerations:
- IPEXLLMClass wraps vLLM's LLM class with IPEX-LLM optimization
- SamplingParams controls generation behavior (temperature, top_p, top_k, etc.)
- dtype should be set to "float16" for XPU compatibility
- tensor_parallel_size enables multi-GPU distribution
Step 4: Online API Serving
Launch the vLLM OpenAI-compatible API server using the IPEX-LLM serving entrypoint. The server exposes /v1/completions and /v1/chat/completions endpoints compatible with OpenAI client libraries. Configure host, port, model name, and serving parameters.
Key considerations:
- Uses start-vllm-service.sh or python -m ipex_llm.vllm.xpu.entrypoints
- Exposes OpenAI-compatible REST API endpoints
- Supports streaming responses for real-time applications
- GPU memory utilization can be tuned via gpu_memory_utilization parameter
Step 5: Performance Benchmarking
Run throughput and latency benchmarks to validate serving performance. The benchmark suite measures tokens per second, time to first token (TTFT), and request completion latency under various concurrency levels and prompt lengths.
Key considerations:
- benchmark_vllm_throughput.py measures sustained throughput
- benchmark_vllm_latency.py measures per-request latency
- vllm_online_benchmark.py tests real-time serving under load
- Compare results across quantization levels to find optimal configuration