Implementation:Intel Ipex llm vLLM API Server

Knowledge Sources	IPEX-LLM vLLM Documentation
Domains	NLP, Serving, API
Last Updated	2026-02-09 00:00 GMT

Overview

External tool for launching an OpenAI-compatible API server using IPEX-LLM's vLLM entrypoint on Intel XPU.

Description

This is an External Tool Doc for the ipex_llm.vllm.xpu.entrypoints.openai.api_server module. It is launched as a Python module from the command line with parameters specifying the model, device, quantization, parallelism, and serving configuration. The server exposes OpenAI-compatible /v1/completions and /v1/chat/completions endpoints.

Usage

Use to deploy an LLM as a production HTTP API server on Intel XPU. Launch via the start-vllm-service.sh script or directly with the Python module command.

Code Reference

Source Location

Repository: IPEX-LLM
File: docker/llm/serving/xpu/docker/start-vllm-service.sh
Lines: 40-56

Signature

python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server \
    --served-model-name MODEL_NAME \
    --port 8000 \
    --model MODEL_PATH \
    --device xpu \
    --dtype float16 \
    --enforce-eager \
    --load-in-low-bit fp8 \
    --tensor-parallel-size 1 \
    --max-model-len 4096 \
    --max-num-seqs 256 \
    --gpu-memory-utilization 0.95

Import

# Launched as a module, not imported
python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server

I/O Contract

Inputs

Name	Type	Required	Description
--model	str	Yes	HuggingFace model ID or local path
--port	int	No	Listening port (default 8000)
--load-in-low-bit	str	No	Quantization format (default "fp8")
--tensor-parallel-size	int	No	Number of GPUs for TP (default 1)
--max-model-len	int	No	Max sequence length
--max-num-seqs	int	No	Max concurrent sequences (default 256)
--gpu-memory-utilization	float	No	GPU memory fraction (default 0.95)

Outputs

Name	Type	Description
HTTP API	REST API	OpenAI-compatible API at http://host:PORT/v1/

Usage Examples

# Set environment variables
export USE_XETLA=OFF
export SYCL_CACHE_PERSISTENT=1

# Launch vLLM API server
python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server \
    --served-model-name "llama2-7b" \
    --port 8000 \
    --model /models/Llama-2-7b-chat-hf \
    --device xpu \
    --dtype float16 \
    --enforce-eager \
    --load-in-low-bit fp8 \
    --max-model-len 4096

# Client usage (from another terminal)
curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{"model": "llama2-7b", "prompt": "Hello, world!", "max_tokens": 100}'

Related Pages

Implements Principle

Principle:Intel_Ipex_llm_Online_API_Serving

Requires Environment

Environment:Intel_Ipex_llm_vLLM_XPU_Serving_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment