Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Triton inference server Server GenAI Perf

From Leeroopedia

Metadata

Field Value
Type Implementation
Workflow LLM_Deployment_With_TRT_LLM
Repo Triton_inference_server_Server
Source docs/getting_started/llm.md:L305-348
Domains Performance, NLP, Benchmarking
Knowledge_Sources TRT-LLM Docs|https://nvidia.github.io/TensorRT-LLM/, source::Repo|Triton Server|https://github.com/triton-inference-server/server
External_dep perf_analyzer package, nvcr.io/nvidia/tritonserver:24.07-py3-sdk container
implements Principle:Triton_inference_server_Server_LLM_Benchmarking
2026-02-13 17:00 GMT

Overview

Concrete benchmarking CLI for measuring LLM serving performance on Triton with GenAI-specific metrics. The genai-perf tool extends the traditional perf_analyzer with LLM-aware workload generation and metric collection.

Description

genai-perf is a purpose-built benchmarking tool for generative AI workloads served by Triton Inference Server. It generates synthetic prompts with configurable token lengths, sends them to the server at specified concurrency levels, and collects LLM-specific metrics including TTFT, ITL, and token throughput.

The tool is typically run from the Triton SDK container (nvcr.io/nvidia/tritonserver:24.07-py3-sdk) which includes the perf_analyzer binary and GenAI-Perf Python package.

Key capabilities:

  • Synthetic workload generation — Creates prompts with configurable mean input token length using a specified tokenizer
  • Streaming measurement — Measures SSE streaming performance with per-token timing
  • Concurrency control — Tests at specified concurrency levels for scaling characterization
  • Artifact generation — Produces JSON and CSV result files for further analysis

Usage

Run from a separate host or container while the Triton server is running. Requires gRPC connectivity to the server (default port 8001).

Code Reference

Source Location

Item Value
File docs/getting_started/llm.md
Lines L305-348
Repo https://github.com/triton-inference-server/server
Tool genai-perf (from perf_analyzer / Triton SDK)

Signature

genai-perf \
    -m ensemble \
    --service-kind triton \
    --backend tensorrtllm \
    --random-seed 123 \
    --synthetic-input-tokens-mean $INPUT_LEN \
    --streaming \
    --output-tokens-mean $OUTPUT_LEN \
    --concurrency $CONC \
    --tokenizer microsoft/Phi-3-mini-4k-instruct \
    --measurement-interval 4000 \
    --url localhost:8001

Import / Verification

# Verify genai-perf is available
genai-perf --help

# Typically available in the Triton SDK container
docker run --rm nvcr.io/nvidia/tritonserver:24.07-py3-sdk genai-perf --help

I/O Contract

Inputs

Name Type Description
-m String Model name to benchmark (e.g., ensemble)
--service-kind String Service type: triton or openai
--backend String Backend type: tensorrtllm
--random-seed Integer Random seed for reproducible synthetic input generation
--synthetic-input-tokens-mean Integer Mean input prompt length in tokens
--streaming Flag Enable SSE streaming measurement
--output-tokens-mean Integer Mean output length in tokens
--concurrency Integer Number of concurrent requests
--tokenizer String HuggingFace tokenizer name or path for token counting
--measurement-interval Integer Measurement window in milliseconds
--url String Triton gRPC endpoint URL

Outputs

Name Type Description
TTFT (ms) Metric Time To First Token — latency to first generated token (p50, p90, p99, avg)
ITL (ms) Metric Inter-Token Latency — average time between consecutive tokens (p50, p90, p99, avg)
Request Latency (ms) Metric End-to-end request latency (p50, p90, p99, avg)
Output Token Throughput (tokens/sec) Metric Total output tokens generated per second across all requests
Request Throughput (req/sec) Metric Completed requests per second
JSON results File Detailed results in artifacts/ directory as JSON
CSV results File Summary results in artifacts/ directory as CSV

Usage Examples

Basic benchmarking run

export INPUT_LEN=128
export OUTPUT_LEN=128
export CONC=1

genai-perf \
    -m ensemble \
    --service-kind triton \
    --backend tensorrtllm \
    --random-seed 123 \
    --synthetic-input-tokens-mean $INPUT_LEN \
    --streaming \
    --output-tokens-mean $OUTPUT_LEN \
    --concurrency $CONC \
    --tokenizer microsoft/Phi-3-mini-4k-instruct \
    --measurement-interval 4000 \
    --url localhost:8001

Concurrency sweep

export INPUT_LEN=128
export OUTPUT_LEN=128

for CONC in 1 2 4 8 16 32; do
    echo "=== Concurrency: $CONC ==="
    genai-perf \
        -m ensemble \
        --service-kind triton \
        --backend tensorrtllm \
        --random-seed 123 \
        --synthetic-input-tokens-mean $INPUT_LEN \
        --streaming \
        --output-tokens-mean $OUTPUT_LEN \
        --concurrency $CONC \
        --tokenizer microsoft/Phi-3-mini-4k-instruct \
        --measurement-interval 4000 \
        --url localhost:8001
    echo ""
done

Run from SDK container

docker run -it --rm \
    --network host \
    nvcr.io/nvidia/tritonserver:24.07-py3-sdk \
    bash -c "genai-perf \
        -m ensemble \
        --service-kind triton \
        --backend tensorrtllm \
        --random-seed 123 \
        --synthetic-input-tokens-mean 128 \
        --streaming \
        --output-tokens-mean 128 \
        --concurrency 4 \
        --tokenizer microsoft/Phi-3-mini-4k-instruct \
        --measurement-interval 4000 \
        --url localhost:8001"

Example output

                          LLM Metrics
┌─────────────────────────┬──────┬──────┬──────┬──────┐
│ Metric                  │ p50  │ p90  │ p99  │ avg  │
├─────────────────────────┼──────┼──────┼──────┼──────┤
│ TTFT (ms)               │ 25.3 │ 31.2 │ 45.1 │ 27.8 │
│ ITL (ms)                │  8.1 │ 10.3 │ 15.7 │  8.9 │
│ Request Latency (ms)    │ 1050 │ 1180 │ 1350 │ 1090 │
│ Output Throughput (t/s) │  —   │  —   │  —   │ 450  │
│ Request Throughput (r/s)│  —   │  —   │  —   │  3.5 │
└─────────────────────────┴──────┴──────┴──────┴──────┘

Key Parameters

Parameter Description Example Value
-m Model name ensemble
--service-kind Service backend type triton
--backend Model backend tensorrtllm
--random-seed Reproducibility seed 123
--synthetic-input-tokens-mean Mean input token count 128
--streaming Enable streaming metrics (flag)
--output-tokens-mean Mean output token count 128
--concurrency Concurrent request count 1, 4, 16
--tokenizer Tokenizer for token counting microsoft/Phi-3-mini-4k-instruct
--measurement-interval Measurement window (ms) 4000
--url Triton gRPC endpoint localhost:8001

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment