Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Intel Ipex llm vLLM API Server

From Leeroopedia


Knowledge Sources
Domains NLP, Serving, API
Last Updated 2026-02-09 00:00 GMT

Overview

External tool for launching an OpenAI-compatible API server using IPEX-LLM's vLLM entrypoint on Intel XPU.

Description

This is an External Tool Doc for the ipex_llm.vllm.xpu.entrypoints.openai.api_server module. It is launched as a Python module from the command line with parameters specifying the model, device, quantization, parallelism, and serving configuration. The server exposes OpenAI-compatible /v1/completions and /v1/chat/completions endpoints.

Usage

Use to deploy an LLM as a production HTTP API server on Intel XPU. Launch via the start-vllm-service.sh script or directly with the Python module command.

Code Reference

Source Location

  • Repository: IPEX-LLM
  • File: docker/llm/serving/xpu/docker/start-vllm-service.sh
  • Lines: 40-56

Signature

python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server \
    --served-model-name MODEL_NAME \
    --port 8000 \
    --model MODEL_PATH \
    --device xpu \
    --dtype float16 \
    --enforce-eager \
    --load-in-low-bit fp8 \
    --tensor-parallel-size 1 \
    --max-model-len 4096 \
    --max-num-seqs 256 \
    --gpu-memory-utilization 0.95

Import

# Launched as a module, not imported
python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server

I/O Contract

Inputs

Name Type Required Description
--model str Yes HuggingFace model ID or local path
--port int No Listening port (default 8000)
--load-in-low-bit str No Quantization format (default "fp8")
--tensor-parallel-size int No Number of GPUs for TP (default 1)
--max-model-len int No Max sequence length
--max-num-seqs int No Max concurrent sequences (default 256)
--gpu-memory-utilization float No GPU memory fraction (default 0.95)

Outputs

Name Type Description
HTTP API REST API OpenAI-compatible API at http://host:PORT/v1/

Usage Examples

# Set environment variables
export USE_XETLA=OFF
export SYCL_CACHE_PERSISTENT=1

# Launch vLLM API server
python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server \
    --served-model-name "llama2-7b" \
    --port 8000 \
    --model /models/Llama-2-7b-chat-hf \
    --device xpu \
    --dtype float16 \
    --enforce-eager \
    --load-in-low-bit fp8 \
    --max-model-len 4096

# Client usage (from another terminal)
curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{"model": "llama2-7b", "prompt": "Hello, world!", "max_tokens": 100}'

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment