Principle:Triton inference server Server LLM Benchmarking

Metadata

Field	Value
Type	Principle
Principle_type	External Tool Doc
Workflow	LLM_Deployment_With_TRT_LLM
Repo	Triton_inference_server_Server
Source	docs/getting_started/llm.md:L305-348
Domains	Performance, NLP, Benchmarking
Knowledge_Sources	TRT-LLM Docs\|https://nvidia.github.io/TensorRT-LLM/, source::Repo\|Triton Server\|https://github.com/triton-inference-server/server
implemented_by	Implementation:Triton_inference_server_Server_GenAI_Perf
2026-02-13 17:00 GMT

Overview

Process of measuring LLM serving performance using metrics specific to generative AI workloads.

Description

LLM benchmarking extends traditional inference benchmarking with metrics unique to auto-regressive generation: Time To First Token (TTFT), Inter-Token Latency (ITL), output token throughput, and streaming latency. These metrics capture the user-perceived quality of LLM serving systems.

Traditional inference benchmarking focuses on request latency and throughput, which are insufficient for LLM workloads because:

Auto-regressive generation produces output incrementally, token by token, so end-to-end latency does not capture the user experience of streaming
Time To First Token (TTFT) measures the responsiveness of the system — how quickly the user sees the first word of the response
Inter-Token Latency (ITL) measures the smoothness of streaming — consistent ITL produces a fluid reading experience, while variable ITL causes stuttering
Output token throughput measures system capacity — how many tokens per second the system can produce across all concurrent requests

LLM benchmarking also requires:

Synthetic workload generation — Creating representative prompt distributions with controlled input/output token lengths
Concurrency sweeps — Testing at various concurrency levels to characterize how the system scales under load
Tokenizer awareness — Measuring in tokens (not bytes or characters) to align with the model's actual processing units

Usage

This principle is applied after the server is running and accepting requests. Benchmarking is typically performed iteratively to characterize the system's performance envelope across different configurations.

Workflow context:

Depends on: Principle:Triton_inference_server_Server_TRT_LLM_Server_Launch
Final step in the LLM deployment workflow

Theoretical Basis

LLM-specific metrics:

Metric	Abbreviation	Description	Unit
Time To First Token	TTFT	Latency from request submission to first token received	milliseconds (ms)
Inter-Token Latency	ITL	Average latency between consecutive output tokens	milliseconds (ms)
Output Token Throughput	—	Total tokens generated per second across all requests	tokens/sec
Request Throughput	—	Number of completed requests per second	requests/sec
Request Latency	—	End-to-end time from request submission to final token	milliseconds (ms)

These metrics are measured at various concurrency levels to characterize scaling behavior:

Low concurrency (1-4) — Tests single-request latency and basic responsiveness
Medium concurrency (8-32) — Tests batching efficiency and KV cache management
High concurrency (64+) — Tests system saturation, queue management, and throughput limits

Key relationships:

TTFT increases with concurrency as requests queue for GPU resources
ITL remains relatively stable until the system saturates, then increases sharply
Output throughput scales linearly with concurrency up to a saturation point, then plateaus
The optimal operating point balances throughput against latency SLAs

Measurement methodology:

Use synthetic inputs with controlled token lengths to enable reproducible comparisons
Set a measurement interval long enough to capture steady-state behavior (e.g., 4000ms)
Use a fixed random seed for reproducibility across runs
Report percentile statistics (p50, p90, p99) in addition to averages

Related Pages

Implementation:Triton_inference_server_Server_GenAI_Perf
Principle:Triton_inference_server_Server_TRT_LLM_Server_Launch — Server must be running
Principle:Triton_inference_server_Server_Generate_API — API used for benchmarking

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment