Principle:Intel Ipex llm Serving Benchmark Methodology

Knowledge Sources	Intel IPEX-LLM
Domains	Benchmarking, Serving, Performance_Testing
Last Updated	2026-02-09 04:00 GMT

Overview

Methodology for measuring LLM serving performance by sending concurrent streaming requests and capturing first-token, next-token, and throughput metrics.

Description

This benchmarking methodology evaluates LLM serving endpoints by sending configurable numbers of concurrent HTTP requests with streaming responses and measuring key latency metrics. The approach captures first-token latency (time to first generated token), next-token latency (inter-token interval), and total throughput. It supports both text-only and multimodal (text+image) request formats and provides statistical aggregation (mean, P50, P90, P99) across all requests.

Usage

Use this methodology when evaluating the real-world performance of deployed vLLM or other OpenAI-compatible serving endpoints. It simulates realistic client behavior with concurrent streaming requests rather than simple sequential benchmarking.

Theoretical Basis

Key metrics for LLM serving:

First-token latency (TTFT): Time from request to first token
Next-token latency (TPOT): Average inter-token time after first token
Throughput: Total tokens generated per second across all requests

Pseudo-code Logic:

# Abstract benchmark methodology
with ThreadPoolExecutor(max_workers=concurrency) as pool:
    futures = [pool.submit(send_streaming_request, prompt) for prompt in prompts]
    for future in futures:
        first_token_time, token_times, total_time = future.result()
        record_metrics(first_token_time, token_times, total_time)
print_statistics(mean, p50, p90, p99)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment