Implementation:Intel Ipex llm Vllm Online Benchmark
| Knowledge Sources | |
|---|---|
| Domains | Benchmarking, Serving, vLLM |
| Last Updated | 2026-02-09 04:00 GMT |
Overview
Concrete tool for benchmarking vLLM serving endpoints with configurable concurrency and latency measurement provided by the IPEX-LLM Docker utilities.
Description
This benchmark tool sends concurrent HTTP requests to a vLLM-compatible OpenAI API endpoint and measures key latency metrics including first-token latency, next-token latency, and total generation time. It uses a thread pool executor for concurrent request management and supports both fixed prompts and dataset-driven request sampling via ShareGPT format.
Usage
Use this tool when evaluating the throughput and latency characteristics of a deployed vLLM serving endpoint on Intel XPU hardware. It is designed for text-only models and provides statistical aggregation (mean, P50, P90, P99) across configurable numbers of concurrent requests.
Code Reference
Source Location
- Repository: Intel IPEX-LLM
- File: docker/llm/serving/xpu/docker/vllm_online_benchmark.py
- Lines: 1-466
Signature
def sample_requests(
dataset_path: str,
num_requests: int,
model_path: str,
seed: int = 42,
) -> List[Tuple[str, int, int]]:
"""Sample and tokenize dataset requests."""
def perform_request(session, url, payload, headers):
"""Execute single streaming HTTP request and measure latency."""
def benchmark(
llm_urls,
model,
prompt,
image_url,
num_requests,
max_concurrent_requests,
max_tokens,
is_warmup=False,
dataset=None,
):
"""Main benchmarking orchestrator using thread pool."""
Import
# Standalone benchmark script; run via:
# python vllm_online_benchmark.py --model "model-name" --prompt "text" --num-requests 100
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model | str | Yes | Model name for API requests |
| prompt | str | No | Fixed prompt text for benchmarking |
| dataset | str | No | Path to ShareGPT-format dataset for varied prompts |
| num-requests | int | No | Number of requests to send (default varies) |
| max-concurrent-requests | int | No | Maximum parallel requests |
| max-tokens | int | No | Maximum tokens to generate per request |
| llm-urls | str | No | Comma-separated API endpoint URLs |
Outputs
| Name | Type | Description |
|---|---|---|
| Latency statistics | Console output | Mean, P50, P90, P99 for first-token and next-token latency |
| Throughput metrics | Console output | Total time, requests per second, tokens per second |
Usage Examples
Basic Benchmark
python vllm_online_benchmark.py \
--model "Llama-2-7b-chat-hf" \
--prompt "What is artificial intelligence?" \
--num-requests 50 \
--max-concurrent-requests 10 \
--max-tokens 128
Dataset-driven Benchmark
python vllm_online_benchmark.py \
--model "Llama-2-7b-chat-hf" \
--dataset "./ShareGPT_V3.json" \
--num-requests 200 \
--max-concurrent-requests 20