Implementation:Intel Ipex llm Vllm Online Benchmark

Knowledge Sources	Intel IPEX-LLM
Domains	Benchmarking, Serving, vLLM
Last Updated	2026-02-09 04:00 GMT

Overview

Concrete tool for benchmarking vLLM serving endpoints with configurable concurrency and latency measurement provided by the IPEX-LLM Docker utilities.

Description

This benchmark tool sends concurrent HTTP requests to a vLLM-compatible OpenAI API endpoint and measures key latency metrics including first-token latency, next-token latency, and total generation time. It uses a thread pool executor for concurrent request management and supports both fixed prompts and dataset-driven request sampling via ShareGPT format.

Usage

Use this tool when evaluating the throughput and latency characteristics of a deployed vLLM serving endpoint on Intel XPU hardware. It is designed for text-only models and provides statistical aggregation (mean, P50, P90, P99) across configurable numbers of concurrent requests.

Code Reference

Source Location

Repository: Intel IPEX-LLM
File: docker/llm/serving/xpu/docker/vllm_online_benchmark.py
Lines: 1-466

Signature

def sample_requests(
    dataset_path: str,
    num_requests: int,
    model_path: str,
    seed: int = 42,
) -> List[Tuple[str, int, int]]:
    """Sample and tokenize dataset requests."""

def perform_request(session, url, payload, headers):
    """Execute single streaming HTTP request and measure latency."""

def benchmark(
    llm_urls,
    model,
    prompt,
    image_url,
    num_requests,
    max_concurrent_requests,
    max_tokens,
    is_warmup=False,
    dataset=None,
):
    """Main benchmarking orchestrator using thread pool."""

Import

# Standalone benchmark script; run via:
# python vllm_online_benchmark.py --model "model-name" --prompt "text" --num-requests 100

I/O Contract

Inputs

Name	Type	Required	Description
model	str	Yes	Model name for API requests
prompt	str	No	Fixed prompt text for benchmarking
dataset	str	No	Path to ShareGPT-format dataset for varied prompts
num-requests	int	No	Number of requests to send (default varies)
max-concurrent-requests	int	No	Maximum parallel requests
max-tokens	int	No	Maximum tokens to generate per request
llm-urls	str	No	Comma-separated API endpoint URLs

Outputs

Name	Type	Description
Latency statistics	Console output	Mean, P50, P90, P99 for first-token and next-token latency
Throughput metrics	Console output	Total time, requests per second, tokens per second

Usage Examples

Basic Benchmark

python vllm_online_benchmark.py \
    --model "Llama-2-7b-chat-hf" \
    --prompt "What is artificial intelligence?" \
    --num-requests 50 \
    --max-concurrent-requests 10 \
    --max-tokens 128

Dataset-driven Benchmark

python vllm_online_benchmark.py \
    --model "Llama-2-7b-chat-hf" \
    --dataset "./ShareGPT_V3.json" \
    --num-requests 200 \
    --max-concurrent-requests 20

Related Pages

Environment:Intel_Ipex_llm_vLLM_XPU_Serving_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment