Implementation:Intel Ipex llm vLLM Generate

Knowledge Sources	IPEX-LLM vLLM Documentation
Domains	NLP, Inference
Last Updated	2026-02-09 00:00 GMT

Overview

Wrapper for vLLM's SamplingParams and LLM.generate() for batch text generation on Intel XPU.

Description

This is a Wrapper Doc for the standard vLLM SamplingParams and llm.generate() API, used through the IPEX-LLM XPU engine. SamplingParams configures generation behavior (temperature, top_p). The generate() method accepts a list of prompts and returns RequestOutput objects. The underlying engine uses PagedAttention optimized for Intel XPU.

External Reference

vLLM Documentation

Usage

Use after initializing the IPEXLLMClass engine to generate text completions from prompts.

Code Reference

Source Location

Repository: IPEX-LLM
File: python/llm/example/GPU/vLLM-Serving/offline_inference.py
Lines: 34-63

Signature

from vllm import SamplingParams

sampling_params = SamplingParams(
    temperature: float = 0.8,
    top_p: float = 0.95,
)

outputs = llm.generate(
    prompts: List[str],
    sampling_params: SamplingParams
) -> List[RequestOutput]

Import

from vllm import SamplingParams

I/O Contract

Inputs

Name	Type	Required	Description
prompts	List[str]	Yes	Batch of input prompt strings
temperature	float	No	Sampling temperature (default 0.8, 0 for greedy)
top_p	float	No	Nucleus sampling threshold (default 0.95)

Outputs

Name	Type	Description
outputs	List[RequestOutput]	Each contains .prompt and .outputs[0].text

Usage Examples

from vllm import SamplingParams
from ipex_llm.vllm.xpu.engine import IPEXLLMClass as LLM

# Initialize engine
llm = LLM(model="meta-llama/Llama-2-7b-chat-hf",
          device="xpu", load_in_low_bit="fp8",
          max_model_len=2000, max_num_batched_tokens=2000)

# Configure sampling
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

# Batch generation
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Related Pages

Implements Principle

Principle:Intel_Ipex_llm_Offline_Batch_Inference

Requires Environment

Environment:Intel_Ipex_llm_vLLM_XPU_Serving_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment