Implementation:Intel Ipex llm vLLM API Server
| Knowledge Sources | |
|---|---|
| Domains | NLP, Serving, API |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
External tool for launching an OpenAI-compatible API server using IPEX-LLM's vLLM entrypoint on Intel XPU.
Description
This is an External Tool Doc for the ipex_llm.vllm.xpu.entrypoints.openai.api_server module. It is launched as a Python module from the command line with parameters specifying the model, device, quantization, parallelism, and serving configuration. The server exposes OpenAI-compatible /v1/completions and /v1/chat/completions endpoints.
Usage
Use to deploy an LLM as a production HTTP API server on Intel XPU. Launch via the start-vllm-service.sh script or directly with the Python module command.
Code Reference
Source Location
- Repository: IPEX-LLM
- File: docker/llm/serving/xpu/docker/start-vllm-service.sh
- Lines: 40-56
Signature
python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server \
--served-model-name MODEL_NAME \
--port 8000 \
--model MODEL_PATH \
--device xpu \
--dtype float16 \
--enforce-eager \
--load-in-low-bit fp8 \
--tensor-parallel-size 1 \
--max-model-len 4096 \
--max-num-seqs 256 \
--gpu-memory-utilization 0.95
Import
# Launched as a module, not imported
python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| --model | str | Yes | HuggingFace model ID or local path |
| --port | int | No | Listening port (default 8000) |
| --load-in-low-bit | str | No | Quantization format (default "fp8") |
| --tensor-parallel-size | int | No | Number of GPUs for TP (default 1) |
| --max-model-len | int | No | Max sequence length |
| --max-num-seqs | int | No | Max concurrent sequences (default 256) |
| --gpu-memory-utilization | float | No | GPU memory fraction (default 0.95) |
Outputs
| Name | Type | Description |
|---|---|---|
| HTTP API | REST API | OpenAI-compatible API at http://host:PORT/v1/ |
Usage Examples
# Set environment variables
export USE_XETLA=OFF
export SYCL_CACHE_PERSISTENT=1
# Launch vLLM API server
python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server \
--served-model-name "llama2-7b" \
--port 8000 \
--model /models/Llama-2-7b-chat-hf \
--device xpu \
--dtype float16 \
--enforce-eager \
--load-in-low-bit fp8 \
--max-model-len 4096
# Client usage (from another terminal)
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{"model": "llama2-7b", "prompt": "Hello, world!", "max_tokens": 100}'