Implementation:Triton inference server Server GenAI Perf
Metadata
| Field | Value |
|---|---|
| Type | Implementation |
| Workflow | LLM_Deployment_With_TRT_LLM |
| Repo | Triton_inference_server_Server |
| Source | docs/getting_started/llm.md:L305-348 |
| Domains | Performance, NLP, Benchmarking |
| Knowledge_Sources | TRT-LLM Docs|https://nvidia.github.io/TensorRT-LLM/, source::Repo|Triton Server|https://github.com/triton-inference-server/server |
| External_dep | perf_analyzer package, nvcr.io/nvidia/tritonserver:24.07-py3-sdk container |
| implements | Principle:Triton_inference_server_Server_LLM_Benchmarking |
| 2026-02-13 17:00 GMT |
Overview
Concrete benchmarking CLI for measuring LLM serving performance on Triton with GenAI-specific metrics. The genai-perf tool extends the traditional perf_analyzer with LLM-aware workload generation and metric collection.
Description
genai-perf is a purpose-built benchmarking tool for generative AI workloads served by Triton Inference Server. It generates synthetic prompts with configurable token lengths, sends them to the server at specified concurrency levels, and collects LLM-specific metrics including TTFT, ITL, and token throughput.
The tool is typically run from the Triton SDK container (nvcr.io/nvidia/tritonserver:24.07-py3-sdk) which includes the perf_analyzer binary and GenAI-Perf Python package.
Key capabilities:
- Synthetic workload generation — Creates prompts with configurable mean input token length using a specified tokenizer
- Streaming measurement — Measures SSE streaming performance with per-token timing
- Concurrency control — Tests at specified concurrency levels for scaling characterization
- Artifact generation — Produces JSON and CSV result files for further analysis
Usage
Run from a separate host or container while the Triton server is running. Requires gRPC connectivity to the server (default port 8001).
Code Reference
Source Location
| Item | Value |
|---|---|
| File | docs/getting_started/llm.md |
| Lines | L305-348 |
| Repo | https://github.com/triton-inference-server/server |
| Tool | genai-perf (from perf_analyzer / Triton SDK) |
Signature
genai-perf \
-m ensemble \
--service-kind triton \
--backend tensorrtllm \
--random-seed 123 \
--synthetic-input-tokens-mean $INPUT_LEN \
--streaming \
--output-tokens-mean $OUTPUT_LEN \
--concurrency $CONC \
--tokenizer microsoft/Phi-3-mini-4k-instruct \
--measurement-interval 4000 \
--url localhost:8001
Import / Verification
# Verify genai-perf is available
genai-perf --help
# Typically available in the Triton SDK container
docker run --rm nvcr.io/nvidia/tritonserver:24.07-py3-sdk genai-perf --help
I/O Contract
Inputs
| Name | Type | Description |
|---|---|---|
-m |
String | Model name to benchmark (e.g., ensemble)
|
--service-kind |
String | Service type: triton or openai
|
--backend |
String | Backend type: tensorrtllm
|
--random-seed |
Integer | Random seed for reproducible synthetic input generation |
--synthetic-input-tokens-mean |
Integer | Mean input prompt length in tokens |
--streaming |
Flag | Enable SSE streaming measurement |
--output-tokens-mean |
Integer | Mean output length in tokens |
--concurrency |
Integer | Number of concurrent requests |
--tokenizer |
String | HuggingFace tokenizer name or path for token counting |
--measurement-interval |
Integer | Measurement window in milliseconds |
--url |
String | Triton gRPC endpoint URL |
Outputs
| Name | Type | Description |
|---|---|---|
| TTFT (ms) | Metric | Time To First Token — latency to first generated token (p50, p90, p99, avg) |
| ITL (ms) | Metric | Inter-Token Latency — average time between consecutive tokens (p50, p90, p99, avg) |
| Request Latency (ms) | Metric | End-to-end request latency (p50, p90, p99, avg) |
| Output Token Throughput (tokens/sec) | Metric | Total output tokens generated per second across all requests |
| Request Throughput (req/sec) | Metric | Completed requests per second |
| JSON results | File | Detailed results in artifacts/ directory as JSON
|
| CSV results | File | Summary results in artifacts/ directory as CSV
|
Usage Examples
Basic benchmarking run
export INPUT_LEN=128
export OUTPUT_LEN=128
export CONC=1
genai-perf \
-m ensemble \
--service-kind triton \
--backend tensorrtllm \
--random-seed 123 \
--synthetic-input-tokens-mean $INPUT_LEN \
--streaming \
--output-tokens-mean $OUTPUT_LEN \
--concurrency $CONC \
--tokenizer microsoft/Phi-3-mini-4k-instruct \
--measurement-interval 4000 \
--url localhost:8001
Concurrency sweep
export INPUT_LEN=128
export OUTPUT_LEN=128
for CONC in 1 2 4 8 16 32; do
echo "=== Concurrency: $CONC ==="
genai-perf \
-m ensemble \
--service-kind triton \
--backend tensorrtllm \
--random-seed 123 \
--synthetic-input-tokens-mean $INPUT_LEN \
--streaming \
--output-tokens-mean $OUTPUT_LEN \
--concurrency $CONC \
--tokenizer microsoft/Phi-3-mini-4k-instruct \
--measurement-interval 4000 \
--url localhost:8001
echo ""
done
Run from SDK container
docker run -it --rm \
--network host \
nvcr.io/nvidia/tritonserver:24.07-py3-sdk \
bash -c "genai-perf \
-m ensemble \
--service-kind triton \
--backend tensorrtllm \
--random-seed 123 \
--synthetic-input-tokens-mean 128 \
--streaming \
--output-tokens-mean 128 \
--concurrency 4 \
--tokenizer microsoft/Phi-3-mini-4k-instruct \
--measurement-interval 4000 \
--url localhost:8001"
Example output
LLM Metrics
┌─────────────────────────┬──────┬──────┬──────┬──────┐
│ Metric │ p50 │ p90 │ p99 │ avg │
├─────────────────────────┼──────┼──────┼──────┼──────┤
│ TTFT (ms) │ 25.3 │ 31.2 │ 45.1 │ 27.8 │
│ ITL (ms) │ 8.1 │ 10.3 │ 15.7 │ 8.9 │
│ Request Latency (ms) │ 1050 │ 1180 │ 1350 │ 1090 │
│ Output Throughput (t/s) │ — │ — │ — │ 450 │
│ Request Throughput (r/s)│ — │ — │ — │ 3.5 │
└─────────────────────────┴──────┴──────┴──────┴──────┘
Key Parameters
| Parameter | Description | Example Value |
|---|---|---|
-m |
Model name | ensemble
|
--service-kind |
Service backend type | triton
|
--backend |
Model backend | tensorrtllm
|
--random-seed |
Reproducibility seed | 123
|
--synthetic-input-tokens-mean |
Mean input token count | 128
|
--streaming |
Enable streaming metrics | (flag) |
--output-tokens-mean |
Mean output token count | 128
|
--concurrency |
Concurrent request count | 1, 4, 16
|
--tokenizer |
Tokenizer for token counting | microsoft/Phi-3-mini-4k-instruct
|
--measurement-interval |
Measurement window (ms) | 4000
|
--url |
Triton gRPC endpoint | localhost:8001
|
Related Pages
- Principle:Triton_inference_server_Server_LLM_Benchmarking
- Implementation:Triton_inference_server_Server_Launch_Triton_Server_Script — Server must be running
- Implementation:Triton_inference_server_Server_HTTP_Generate_Endpoint — API endpoint being benchmarked
- Environment:Triton_inference_server_Server_TRT_LLM_Deployment