Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Intel Ipex llm vLLM Serving

From Leeroopedia


Knowledge Sources
Domains LLMs, Serving, Inference
Last Updated 2026-02-09 04:00 GMT

Overview

End-to-end process for serving Large Language Models on Intel XPU hardware using vLLM with IPEX-LLM backend for high-throughput, low-latency inference.

Description

This workflow sets up a vLLM-based model serving pipeline accelerated by IPEX-LLM on Intel GPUs. It uses the IPEXLLMClass engine wrapper that integrates IPEX-LLM's low-bit quantization (FP8/FP6/FP4/INT4) with vLLM's continuous batching and PagedAttention for efficient multi-request serving. The workflow supports both offline batch inference and online OpenAI-compatible API serving, with tensor parallelism for distributing large models across multiple Intel GPUs. It covers Docker-based deployment, model loading with quantization, sampling parameter configuration, and throughput benchmarking.

Usage

Execute this workflow when you need to deploy an LLM as a production inference service on Intel GPU infrastructure, handling concurrent requests with high throughput. Suitable for both single-GPU serving of smaller models and multi-GPU tensor-parallel serving of larger models (13B+). Works with Intel Arc, Flex, and Max GPU families.

Execution Steps

Step 1: Environment Setup

Prepare the serving environment by pulling the IPEX-LLM vLLM Docker image or installing ipex-llm[xpu] with vllm dependencies. Configure Intel GPU runtime variables (oneAPI toolkit, Level Zero driver). For Docker deployments, mount model directories and configure GPU device passthrough.

Key considerations:

  • Docker images available for both CPU and XPU (GPU) serving
  • XPU Docker requires --device /dev/dri passthrough for GPU access
  • Install ipex-llm[serving] or ipex-llm[xpu] depending on hardware target
  • Source oneAPI setvars.sh for non-Docker installations

Step 2: Model Selection and Quantization Configuration

Select the target model from HuggingFace Hub or local path and configure the quantization level. IPEX-LLM supports multiple low-bit formats (fp8, fp6, fp4, sym_int4) that can be specified via the load_in_low_bit parameter. Choose the quantization level based on the tradeoff between model quality and memory/speed requirements.

Key considerations:

  • FP8 provides the best quality-to-compression ratio for serving
  • SYM_INT4 provides maximum compression for memory-constrained environments
  • enforce_eager=True is required for XPU (CUDA graph not supported)
  • max_model_len and max_num_batched_tokens control memory allocation

Step 3: Offline Batch Inference

For batch processing scenarios, create an LLM engine instance with IPEXLLMClass, configure SamplingParams (temperature, top_p, max_tokens), and submit a batch of prompts for parallel generation. Collect and process the RequestOutput objects containing generated text.

Key considerations:

  • IPEXLLMClass wraps vLLM's LLM class with IPEX-LLM optimization
  • SamplingParams controls generation behavior (temperature, top_p, top_k, etc.)
  • dtype should be set to "float16" for XPU compatibility
  • tensor_parallel_size enables multi-GPU distribution

Step 4: Online API Serving

Launch the vLLM OpenAI-compatible API server using the IPEX-LLM serving entrypoint. The server exposes /v1/completions and /v1/chat/completions endpoints compatible with OpenAI client libraries. Configure host, port, model name, and serving parameters.

Key considerations:

  • Uses start-vllm-service.sh or python -m ipex_llm.vllm.xpu.entrypoints
  • Exposes OpenAI-compatible REST API endpoints
  • Supports streaming responses for real-time applications
  • GPU memory utilization can be tuned via gpu_memory_utilization parameter

Step 5: Performance Benchmarking

Run throughput and latency benchmarks to validate serving performance. The benchmark suite measures tokens per second, time to first token (TTFT), and request completion latency under various concurrency levels and prompt lengths.

Key considerations:

  • benchmark_vllm_throughput.py measures sustained throughput
  • benchmark_vllm_latency.py measures per-request latency
  • vllm_online_benchmark.py tests real-time serving under load
  • Compare results across quantization levels to find optimal configuration

Execution Diagram

GitHub URL

Workflow Repository