Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Intel Ipex llm Vllm Online Benchmark

From Leeroopedia


Knowledge Sources
Domains Benchmarking, Serving, vLLM
Last Updated 2026-02-09 04:00 GMT

Overview

Concrete tool for benchmarking vLLM serving endpoints with configurable concurrency and latency measurement provided by the IPEX-LLM Docker utilities.

Description

This benchmark tool sends concurrent HTTP requests to a vLLM-compatible OpenAI API endpoint and measures key latency metrics including first-token latency, next-token latency, and total generation time. It uses a thread pool executor for concurrent request management and supports both fixed prompts and dataset-driven request sampling via ShareGPT format.

Usage

Use this tool when evaluating the throughput and latency characteristics of a deployed vLLM serving endpoint on Intel XPU hardware. It is designed for text-only models and provides statistical aggregation (mean, P50, P90, P99) across configurable numbers of concurrent requests.

Code Reference

Source Location

Signature

def sample_requests(
    dataset_path: str,
    num_requests: int,
    model_path: str,
    seed: int = 42,
) -> List[Tuple[str, int, int]]:
    """Sample and tokenize dataset requests."""

def perform_request(session, url, payload, headers):
    """Execute single streaming HTTP request and measure latency."""

def benchmark(
    llm_urls,
    model,
    prompt,
    image_url,
    num_requests,
    max_concurrent_requests,
    max_tokens,
    is_warmup=False,
    dataset=None,
):
    """Main benchmarking orchestrator using thread pool."""

Import

# Standalone benchmark script; run via:
# python vllm_online_benchmark.py --model "model-name" --prompt "text" --num-requests 100

I/O Contract

Inputs

Name Type Required Description
model str Yes Model name for API requests
prompt str No Fixed prompt text for benchmarking
dataset str No Path to ShareGPT-format dataset for varied prompts
num-requests int No Number of requests to send (default varies)
max-concurrent-requests int No Maximum parallel requests
max-tokens int No Maximum tokens to generate per request
llm-urls str No Comma-separated API endpoint URLs

Outputs

Name Type Description
Latency statistics Console output Mean, P50, P90, P99 for first-token and next-token latency
Throughput metrics Console output Total time, requests per second, tokens per second

Usage Examples

Basic Benchmark

python vllm_online_benchmark.py \
    --model "Llama-2-7b-chat-hf" \
    --prompt "What is artificial intelligence?" \
    --num-requests 50 \
    --max-concurrent-requests 10 \
    --max-tokens 128

Dataset-driven Benchmark

python vllm_online_benchmark.py \
    --model "Llama-2-7b-chat-hf" \
    --dataset "./ShareGPT_V3.json" \
    --num-requests 200 \
    --max-concurrent-requests 20

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment