Implementation:Intel Ipex llm IPEXLLMClass Init

Knowledge Sources	IPEX-LLM vLLM Documentation
Domains	NLP, Model_Quantization, Serving
Last Updated	2026-02-09 00:00 GMT

Overview

Concrete tool for initializing the IPEX-LLM vLLM engine with low-bit quantization on Intel XPU.

Description

The IPEXLLMClass (aliased as LLM) is IPEX-LLM's wrapper around the vLLM LLM engine that adds Intel XPU support and low-bit quantization. It initializes the model on XPU devices, configures tensor parallelism via Ray, and sets up the vLLM engine for continuous batching inference.

Usage

Use this to initialize a vLLM engine for offline batch inference or as the backend for online API serving on Intel XPU.

Code Reference

Source Location

Repository: IPEX-LLM
File: python/llm/example/GPU/vLLM-Serving/offline_inference.py
Lines: 48-55

Signature

from ipex_llm.vllm.xpu.engine import IPEXLLMClass as LLM

llm = LLM(
    model: str,                         # HF model ID or local path
    device: str = "xpu",                # Device type
    dtype: str = "float16",             # Model dtype
    enforce_eager: bool = True,         # Disable CUDA graph (required for XPU)
    load_in_low_bit: str = "fp8",       # Quantization format
    tensor_parallel_size: int = 1,      # Number of GPUs for TP
    max_model_len: int = 2000,          # Max sequence length
    max_num_batched_tokens: int = 2000, # Max tokens in a batch
    trust_remote_code: bool = True,     # Allow custom model code
    block_size: int = 8,                # KV cache block size
) -> LLM

Import

from ipex_llm.vllm.xpu.engine import IPEXLLMClass as LLM

I/O Contract

Inputs

Name	Type	Required	Description
model	str	Yes	HuggingFace model ID or local path
device	str	No	Device type, "xpu" for Intel GPU (default "xpu")
load_in_low_bit	str	No	Quantization format: "fp8", "sym_int4", "fp6", etc. (default "fp8")
tensor_parallel_size	int	No	Number of GPUs for tensor parallelism (default 1)
max_model_len	int	No	Maximum sequence length the model can handle (default 2000)
max_num_batched_tokens	int	No	Maximum tokens in a single batch (default 2000)
block_size	int	No	KV cache block size (default 8)

Outputs

Name	Type	Description
llm	LLM	Initialized vLLM engine with model loaded and quantized on XPU

Usage Examples

from ipex_llm.vllm.xpu.engine import IPEXLLMClass as LLM

# Initialize engine with FP8 quantization
llm = LLM(
    model="meta-llama/Llama-2-7b-chat-hf",
    device="xpu",
    dtype="float16",
    enforce_eager=True,
    load_in_low_bit="fp8",
    tensor_parallel_size=1,
    max_model_len=2000,
    max_num_batched_tokens=2000,
)

# For multi-GPU with SYM_INT4
llm_multi = LLM(
    model="meta-llama/Llama-2-13b-chat-hf",
    device="xpu",
    load_in_low_bit="sym_int4",
    tensor_parallel_size=2,
    max_model_len=4096,
    max_num_batched_tokens=4096,
    distributed_executor_backend="ray",
)

Related Pages

Implements Principle

Principle:Intel_Ipex_llm_Low_Bit_Quantization_For_Serving

Requires Environment

Environment:Intel_Ipex_llm_vLLM_XPU_Serving_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment