Implementation:Triton inference server Server GenAI Perf

Metadata

Field	Value
Type	Implementation
Workflow	LLM_Deployment_With_TRT_LLM
Repo	Triton_inference_server_Server
Source	docs/getting_started/llm.md:L305-348
Domains	Performance, NLP, Benchmarking
Knowledge_Sources	TRT-LLM Docs\|https://nvidia.github.io/TensorRT-LLM/, source::Repo\|Triton Server\|https://github.com/triton-inference-server/server
External_dep	perf_analyzer package, nvcr.io/nvidia/tritonserver:24.07-py3-sdk container
implements	Principle:Triton_inference_server_Server_LLM_Benchmarking
2026-02-13 17:00 GMT

Overview

Concrete benchmarking CLI for measuring LLM serving performance on Triton with GenAI-specific metrics. The genai-perf tool extends the traditional perf_analyzer with LLM-aware workload generation and metric collection.

Description

genai-perf is a purpose-built benchmarking tool for generative AI workloads served by Triton Inference Server. It generates synthetic prompts with configurable token lengths, sends them to the server at specified concurrency levels, and collects LLM-specific metrics including TTFT, ITL, and token throughput.

The tool is typically run from the Triton SDK container (nvcr.io/nvidia/tritonserver:24.07-py3-sdk) which includes the perf_analyzer binary and GenAI-Perf Python package.

Key capabilities:

Synthetic workload generation — Creates prompts with configurable mean input token length using a specified tokenizer
Streaming measurement — Measures SSE streaming performance with per-token timing
Concurrency control — Tests at specified concurrency levels for scaling characterization
Artifact generation — Produces JSON and CSV result files for further analysis

Usage

Run from a separate host or container while the Triton server is running. Requires gRPC connectivity to the server (default port 8001).

Code Reference

Source Location

Item	Value
File	docs/getting_started/llm.md
Lines	L305-348
Repo	https://github.com/triton-inference-server/server
Tool	genai-perf (from perf_analyzer / Triton SDK)

Signature

genai-perf \
    -m ensemble \
    --service-kind triton \
    --backend tensorrtllm \
    --random-seed 123 \
    --synthetic-input-tokens-mean $INPUT_LEN \
    --streaming \
    --output-tokens-mean $OUTPUT_LEN \
    --concurrency $CONC \
    --tokenizer microsoft/Phi-3-mini-4k-instruct \
    --measurement-interval 4000 \
    --url localhost:8001

Import / Verification

# Verify genai-perf is available
genai-perf --help

# Typically available in the Triton SDK container
docker run --rm nvcr.io/nvidia/tritonserver:24.07-py3-sdk genai-perf --help

I/O Contract

Inputs

Name	Type	Description
`-m`	String	Model name to benchmark (e.g., `ensemble`)
`--service-kind`	String	Service type: `triton` or `openai`
`--backend`	String	Backend type: `tensorrtllm`
`--random-seed`	Integer	Random seed for reproducible synthetic input generation
`--synthetic-input-tokens-mean`	Integer	Mean input prompt length in tokens
`--streaming`	Flag	Enable SSE streaming measurement
`--output-tokens-mean`	Integer	Mean output length in tokens
`--concurrency`	Integer	Number of concurrent requests
`--tokenizer`	String	HuggingFace tokenizer name or path for token counting
`--measurement-interval`	Integer	Measurement window in milliseconds
`--url`	String	Triton gRPC endpoint URL

Outputs

Name	Type	Description
TTFT (ms)	Metric	Time To First Token — latency to first generated token (p50, p90, p99, avg)
ITL (ms)	Metric	Inter-Token Latency — average time between consecutive tokens (p50, p90, p99, avg)
Request Latency (ms)	Metric	End-to-end request latency (p50, p90, p99, avg)
Output Token Throughput (tokens/sec)	Metric	Total output tokens generated per second across all requests
Request Throughput (req/sec)	Metric	Completed requests per second
JSON results	File	Detailed results in `artifacts/` directory as JSON
CSV results	File	Summary results in `artifacts/` directory as CSV

Usage Examples

Basic benchmarking run

export INPUT_LEN=128
export OUTPUT_LEN=128
export CONC=1

genai-perf \
    -m ensemble \
    --service-kind triton \
    --backend tensorrtllm \
    --random-seed 123 \
    --synthetic-input-tokens-mean $INPUT_LEN \
    --streaming \
    --output-tokens-mean $OUTPUT_LEN \
    --concurrency $CONC \
    --tokenizer microsoft/Phi-3-mini-4k-instruct \
    --measurement-interval 4000 \
    --url localhost:8001

Concurrency sweep

export INPUT_LEN=128
export OUTPUT_LEN=128

for CONC in 1 2 4 8 16 32; do
    echo "=== Concurrency: $CONC ==="
    genai-perf \
        -m ensemble \
        --service-kind triton \
        --backend tensorrtllm \
        --random-seed 123 \
        --synthetic-input-tokens-mean $INPUT_LEN \
        --streaming \
        --output-tokens-mean $OUTPUT_LEN \
        --concurrency $CONC \
        --tokenizer microsoft/Phi-3-mini-4k-instruct \
        --measurement-interval 4000 \
        --url localhost:8001
    echo ""
done

Run from SDK container

docker run -it --rm \
    --network host \
    nvcr.io/nvidia/tritonserver:24.07-py3-sdk \
    bash -c "genai-perf \
        -m ensemble \
        --service-kind triton \
        --backend tensorrtllm \
        --random-seed 123 \
        --synthetic-input-tokens-mean 128 \
        --streaming \
        --output-tokens-mean 128 \
        --concurrency 4 \
        --tokenizer microsoft/Phi-3-mini-4k-instruct \
        --measurement-interval 4000 \
        --url localhost:8001"

Example output

                          LLM Metrics
┌─────────────────────────┬──────┬──────┬──────┬──────┐
│ Metric                  │ p50  │ p90  │ p99  │ avg  │
├─────────────────────────┼──────┼──────┼──────┼──────┤
│ TTFT (ms)               │ 25.3 │ 31.2 │ 45.1 │ 27.8 │
│ ITL (ms)                │  8.1 │ 10.3 │ 15.7 │  8.9 │
│ Request Latency (ms)    │ 1050 │ 1180 │ 1350 │ 1090 │
│ Output Throughput (t/s) │  —   │  —   │  —   │ 450  │
│ Request Throughput (r/s)│  —   │  —   │  —   │  3.5 │
└─────────────────────────┴──────┴──────┴──────┴──────┘

Key Parameters

Parameter	Description	Example Value
`-m`	Model name	`ensemble`
`--service-kind`	Service backend type	`triton`
`--backend`	Model backend	`tensorrtllm`
`--random-seed`	Reproducibility seed	`123`
`--synthetic-input-tokens-mean`	Mean input token count	`128`
`--streaming`	Enable streaming metrics	(flag)
`--output-tokens-mean`	Mean output token count	`128`
`--concurrency`	Concurrent request count	`1`, `4`, `16`
`--tokenizer`	Tokenizer for token counting	`microsoft/Phi-3-mini-4k-instruct`
`--measurement-interval`	Measurement window (ms)	`4000`
`--url`	Triton gRPC endpoint	`localhost:8001`

Related Pages

Principle:Triton_inference_server_Server_LLM_Benchmarking
Implementation:Triton_inference_server_Server_Launch_Triton_Server_Script — Server must be running
Implementation:Triton_inference_server_Server_HTTP_Generate_Endpoint — API endpoint being benchmarked
Environment:Triton_inference_server_Server_TRT_LLM_Deployment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment