Principle:Triton inference server Server LLM Benchmarking
Metadata
| Field | Value |
|---|---|
| Type | Principle |
| Principle_type | External Tool Doc |
| Workflow | LLM_Deployment_With_TRT_LLM |
| Repo | Triton_inference_server_Server |
| Source | docs/getting_started/llm.md:L305-348 |
| Domains | Performance, NLP, Benchmarking |
| Knowledge_Sources | TRT-LLM Docs|https://nvidia.github.io/TensorRT-LLM/, source::Repo|Triton Server|https://github.com/triton-inference-server/server |
| implemented_by | Implementation:Triton_inference_server_Server_GenAI_Perf |
| 2026-02-13 17:00 GMT |
Overview
Process of measuring LLM serving performance using metrics specific to generative AI workloads.
Description
LLM benchmarking extends traditional inference benchmarking with metrics unique to auto-regressive generation: Time To First Token (TTFT), Inter-Token Latency (ITL), output token throughput, and streaming latency. These metrics capture the user-perceived quality of LLM serving systems.
Traditional inference benchmarking focuses on request latency and throughput, which are insufficient for LLM workloads because:
- Auto-regressive generation produces output incrementally, token by token, so end-to-end latency does not capture the user experience of streaming
- Time To First Token (TTFT) measures the responsiveness of the system — how quickly the user sees the first word of the response
- Inter-Token Latency (ITL) measures the smoothness of streaming — consistent ITL produces a fluid reading experience, while variable ITL causes stuttering
- Output token throughput measures system capacity — how many tokens per second the system can produce across all concurrent requests
LLM benchmarking also requires:
- Synthetic workload generation — Creating representative prompt distributions with controlled input/output token lengths
- Concurrency sweeps — Testing at various concurrency levels to characterize how the system scales under load
- Tokenizer awareness — Measuring in tokens (not bytes or characters) to align with the model's actual processing units
Usage
This principle is applied after the server is running and accepting requests. Benchmarking is typically performed iteratively to characterize the system's performance envelope across different configurations.
Workflow context:
- Depends on: Principle:Triton_inference_server_Server_TRT_LLM_Server_Launch
- Final step in the LLM deployment workflow
Theoretical Basis
LLM-specific metrics:
| Metric | Abbreviation | Description | Unit |
|---|---|---|---|
| Time To First Token | TTFT | Latency from request submission to first token received | milliseconds (ms) |
| Inter-Token Latency | ITL | Average latency between consecutive output tokens | milliseconds (ms) |
| Output Token Throughput | — | Total tokens generated per second across all requests | tokens/sec |
| Request Throughput | — | Number of completed requests per second | requests/sec |
| Request Latency | — | End-to-end time from request submission to final token | milliseconds (ms) |
These metrics are measured at various concurrency levels to characterize scaling behavior:
- Low concurrency (1-4) — Tests single-request latency and basic responsiveness
- Medium concurrency (8-32) — Tests batching efficiency and KV cache management
- High concurrency (64+) — Tests system saturation, queue management, and throughput limits
Key relationships:
- TTFT increases with concurrency as requests queue for GPU resources
- ITL remains relatively stable until the system saturates, then increases sharply
- Output throughput scales linearly with concurrency up to a saturation point, then plateaus
- The optimal operating point balances throughput against latency SLAs
Measurement methodology:
- Use synthetic inputs with controlled token lengths to enable reproducible comparisons
- Set a measurement interval long enough to capture steady-state behavior (e.g., 4000ms)
- Use a fixed random seed for reproducibility across runs
- Report percentile statistics (p50, p90, p99) in addition to averages
Related Pages
- Implementation:Triton_inference_server_Server_GenAI_Perf
- Principle:Triton_inference_server_Server_TRT_LLM_Server_Launch — Server must be running
- Principle:Triton_inference_server_Server_Generate_API — API used for benchmarking