Principle:Intel Ipex llm Serving Benchmark Methodology
| Knowledge Sources | |
|---|---|
| Domains | Benchmarking, Serving, Performance_Testing |
| Last Updated | 2026-02-09 04:00 GMT |
Overview
Methodology for measuring LLM serving performance by sending concurrent streaming requests and capturing first-token, next-token, and throughput metrics.
Description
This benchmarking methodology evaluates LLM serving endpoints by sending configurable numbers of concurrent HTTP requests with streaming responses and measuring key latency metrics. The approach captures first-token latency (time to first generated token), next-token latency (inter-token interval), and total throughput. It supports both text-only and multimodal (text+image) request formats and provides statistical aggregation (mean, P50, P90, P99) across all requests.
Usage
Use this methodology when evaluating the real-world performance of deployed vLLM or other OpenAI-compatible serving endpoints. It simulates realistic client behavior with concurrent streaming requests rather than simple sequential benchmarking.
Theoretical Basis
Key metrics for LLM serving:
- First-token latency (TTFT): Time from request to first token
- Next-token latency (TPOT): Average inter-token time after first token
- Throughput: Total tokens generated per second across all requests
Pseudo-code Logic:
# Abstract benchmark methodology
with ThreadPoolExecutor(max_workers=concurrency) as pool:
futures = [pool.submit(send_streaming_request, prompt) for prompt in prompts]
for future in futures:
first_token_time, token_times, total_time = future.result()
record_metrics(first_token_time, token_times, total_time)
print_statistics(mean, p50, p90, p99)