Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Intel Ipex llm vLLM Generate

From Leeroopedia


Knowledge Sources
Domains NLP, Inference
Last Updated 2026-02-09 00:00 GMT

Overview

Wrapper for vLLM's SamplingParams and LLM.generate() for batch text generation on Intel XPU.

Description

This is a Wrapper Doc for the standard vLLM SamplingParams and llm.generate() API, used through the IPEX-LLM XPU engine. SamplingParams configures generation behavior (temperature, top_p). The generate() method accepts a list of prompts and returns RequestOutput objects. The underlying engine uses PagedAttention optimized for Intel XPU.

External Reference

Usage

Use after initializing the IPEXLLMClass engine to generate text completions from prompts.

Code Reference

Source Location

  • Repository: IPEX-LLM
  • File: python/llm/example/GPU/vLLM-Serving/offline_inference.py
  • Lines: 34-63

Signature

from vllm import SamplingParams

sampling_params = SamplingParams(
    temperature: float = 0.8,
    top_p: float = 0.95,
)

outputs = llm.generate(
    prompts: List[str],
    sampling_params: SamplingParams
) -> List[RequestOutput]

Import

from vllm import SamplingParams

I/O Contract

Inputs

Name Type Required Description
prompts List[str] Yes Batch of input prompt strings
temperature float No Sampling temperature (default 0.8, 0 for greedy)
top_p float No Nucleus sampling threshold (default 0.95)

Outputs

Name Type Description
outputs List[RequestOutput] Each contains .prompt and .outputs[0].text

Usage Examples

from vllm import SamplingParams
from ipex_llm.vllm.xpu.engine import IPEXLLMClass as LLM

# Initialize engine
llm = LLM(model="meta-llama/Llama-2-7b-chat-hf",
          device="xpu", load_in_low_bit="fp8",
          max_model_len=2000, max_num_batched_tokens=2000)

# Configure sampling
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

# Batch generation
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment