Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Triton inference server Server LLM Benchmarking

From Leeroopedia

Metadata

Field Value
Type Principle
Principle_type External Tool Doc
Workflow LLM_Deployment_With_TRT_LLM
Repo Triton_inference_server_Server
Source docs/getting_started/llm.md:L305-348
Domains Performance, NLP, Benchmarking
Knowledge_Sources TRT-LLM Docs|https://nvidia.github.io/TensorRT-LLM/, source::Repo|Triton Server|https://github.com/triton-inference-server/server
implemented_by Implementation:Triton_inference_server_Server_GenAI_Perf
2026-02-13 17:00 GMT

Overview

Process of measuring LLM serving performance using metrics specific to generative AI workloads.

Description

LLM benchmarking extends traditional inference benchmarking with metrics unique to auto-regressive generation: Time To First Token (TTFT), Inter-Token Latency (ITL), output token throughput, and streaming latency. These metrics capture the user-perceived quality of LLM serving systems.

Traditional inference benchmarking focuses on request latency and throughput, which are insufficient for LLM workloads because:

  • Auto-regressive generation produces output incrementally, token by token, so end-to-end latency does not capture the user experience of streaming
  • Time To First Token (TTFT) measures the responsiveness of the system — how quickly the user sees the first word of the response
  • Inter-Token Latency (ITL) measures the smoothness of streaming — consistent ITL produces a fluid reading experience, while variable ITL causes stuttering
  • Output token throughput measures system capacity — how many tokens per second the system can produce across all concurrent requests

LLM benchmarking also requires:

  • Synthetic workload generation — Creating representative prompt distributions with controlled input/output token lengths
  • Concurrency sweeps — Testing at various concurrency levels to characterize how the system scales under load
  • Tokenizer awareness — Measuring in tokens (not bytes or characters) to align with the model's actual processing units

Usage

This principle is applied after the server is running and accepting requests. Benchmarking is typically performed iteratively to characterize the system's performance envelope across different configurations.

Workflow context:

Theoretical Basis

LLM-specific metrics:

Metric Abbreviation Description Unit
Time To First Token TTFT Latency from request submission to first token received milliseconds (ms)
Inter-Token Latency ITL Average latency between consecutive output tokens milliseconds (ms)
Output Token Throughput Total tokens generated per second across all requests tokens/sec
Request Throughput Number of completed requests per second requests/sec
Request Latency End-to-end time from request submission to final token milliseconds (ms)

These metrics are measured at various concurrency levels to characterize scaling behavior:

  • Low concurrency (1-4) — Tests single-request latency and basic responsiveness
  • Medium concurrency (8-32) — Tests batching efficiency and KV cache management
  • High concurrency (64+) — Tests system saturation, queue management, and throughput limits

Key relationships:

  • TTFT increases with concurrency as requests queue for GPU resources
  • ITL remains relatively stable until the system saturates, then increases sharply
  • Output throughput scales linearly with concurrency up to a saturation point, then plateaus
  • The optimal operating point balances throughput against latency SLAs

Measurement methodology:

  • Use synthetic inputs with controlled token lengths to enable reproducible comparisons
  • Set a measurement interval long enough to capture steady-state behavior (e.g., 4000ms)
  • Use a fixed random seed for reproducibility across runs
  • Report percentile statistics (p50, p90, p99) in addition to averages

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment