Implementation:Intel Ipex llm IPEXLLMClass Init
Appearance
| Knowledge Sources | |
|---|---|
| Domains | NLP, Model_Quantization, Serving |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Concrete tool for initializing the IPEX-LLM vLLM engine with low-bit quantization on Intel XPU.
Description
The IPEXLLMClass (aliased as LLM) is IPEX-LLM's wrapper around the vLLM LLM engine that adds Intel XPU support and low-bit quantization. It initializes the model on XPU devices, configures tensor parallelism via Ray, and sets up the vLLM engine for continuous batching inference.
Usage
Use this to initialize a vLLM engine for offline batch inference or as the backend for online API serving on Intel XPU.
Code Reference
Source Location
- Repository: IPEX-LLM
- File: python/llm/example/GPU/vLLM-Serving/offline_inference.py
- Lines: 48-55
Signature
from ipex_llm.vllm.xpu.engine import IPEXLLMClass as LLM
llm = LLM(
model: str, # HF model ID or local path
device: str = "xpu", # Device type
dtype: str = "float16", # Model dtype
enforce_eager: bool = True, # Disable CUDA graph (required for XPU)
load_in_low_bit: str = "fp8", # Quantization format
tensor_parallel_size: int = 1, # Number of GPUs for TP
max_model_len: int = 2000, # Max sequence length
max_num_batched_tokens: int = 2000, # Max tokens in a batch
trust_remote_code: bool = True, # Allow custom model code
block_size: int = 8, # KV cache block size
) -> LLM
Import
from ipex_llm.vllm.xpu.engine import IPEXLLMClass as LLM
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model | str | Yes | HuggingFace model ID or local path |
| device | str | No | Device type, "xpu" for Intel GPU (default "xpu") |
| load_in_low_bit | str | No | Quantization format: "fp8", "sym_int4", "fp6", etc. (default "fp8") |
| tensor_parallel_size | int | No | Number of GPUs for tensor parallelism (default 1) |
| max_model_len | int | No | Maximum sequence length the model can handle (default 2000) |
| max_num_batched_tokens | int | No | Maximum tokens in a single batch (default 2000) |
| block_size | int | No | KV cache block size (default 8) |
Outputs
| Name | Type | Description |
|---|---|---|
| llm | LLM | Initialized vLLM engine with model loaded and quantized on XPU |
Usage Examples
from ipex_llm.vllm.xpu.engine import IPEXLLMClass as LLM
# Initialize engine with FP8 quantization
llm = LLM(
model="meta-llama/Llama-2-7b-chat-hf",
device="xpu",
dtype="float16",
enforce_eager=True,
load_in_low_bit="fp8",
tensor_parallel_size=1,
max_model_len=2000,
max_num_batched_tokens=2000,
)
# For multi-GPU with SYM_INT4
llm_multi = LLM(
model="meta-llama/Llama-2-13b-chat-hf",
device="xpu",
load_in_low_bit="sym_int4",
tensor_parallel_size=2,
max_model_len=4096,
max_num_batched_tokens=4096,
distributed_executor_backend="ray",
)
Related Pages
Implements Principle
Requires Environment
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment