Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Intel Ipex llm IPEXLLMClass Init

From Leeroopedia
Revision as of 15:12, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Intel_Ipex_llm_IPEXLLMClass_Init.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains NLP, Model_Quantization, Serving
Last Updated 2026-02-09 00:00 GMT

Overview

Concrete tool for initializing the IPEX-LLM vLLM engine with low-bit quantization on Intel XPU.

Description

The IPEXLLMClass (aliased as LLM) is IPEX-LLM's wrapper around the vLLM LLM engine that adds Intel XPU support and low-bit quantization. It initializes the model on XPU devices, configures tensor parallelism via Ray, and sets up the vLLM engine for continuous batching inference.

Usage

Use this to initialize a vLLM engine for offline batch inference or as the backend for online API serving on Intel XPU.

Code Reference

Source Location

  • Repository: IPEX-LLM
  • File: python/llm/example/GPU/vLLM-Serving/offline_inference.py
  • Lines: 48-55

Signature

from ipex_llm.vllm.xpu.engine import IPEXLLMClass as LLM

llm = LLM(
    model: str,                         # HF model ID or local path
    device: str = "xpu",                # Device type
    dtype: str = "float16",             # Model dtype
    enforce_eager: bool = True,         # Disable CUDA graph (required for XPU)
    load_in_low_bit: str = "fp8",       # Quantization format
    tensor_parallel_size: int = 1,      # Number of GPUs for TP
    max_model_len: int = 2000,          # Max sequence length
    max_num_batched_tokens: int = 2000, # Max tokens in a batch
    trust_remote_code: bool = True,     # Allow custom model code
    block_size: int = 8,                # KV cache block size
) -> LLM

Import

from ipex_llm.vllm.xpu.engine import IPEXLLMClass as LLM

I/O Contract

Inputs

Name Type Required Description
model str Yes HuggingFace model ID or local path
device str No Device type, "xpu" for Intel GPU (default "xpu")
load_in_low_bit str No Quantization format: "fp8", "sym_int4", "fp6", etc. (default "fp8")
tensor_parallel_size int No Number of GPUs for tensor parallelism (default 1)
max_model_len int No Maximum sequence length the model can handle (default 2000)
max_num_batched_tokens int No Maximum tokens in a single batch (default 2000)
block_size int No KV cache block size (default 8)

Outputs

Name Type Description
llm LLM Initialized vLLM engine with model loaded and quantized on XPU

Usage Examples

from ipex_llm.vllm.xpu.engine import IPEXLLMClass as LLM

# Initialize engine with FP8 quantization
llm = LLM(
    model="meta-llama/Llama-2-7b-chat-hf",
    device="xpu",
    dtype="float16",
    enforce_eager=True,
    load_in_low_bit="fp8",
    tensor_parallel_size=1,
    max_model_len=2000,
    max_num_batched_tokens=2000,
)

# For multi-GPU with SYM_INT4
llm_multi = LLM(
    model="meta-llama/Llama-2-13b-chat-hf",
    device="xpu",
    load_in_low_bit="sym_int4",
    tensor_parallel_size=2,
    max_model_len=4096,
    max_num_batched_tokens=4096,
    distributed_executor_backend="ray",
)

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment