Implementation:Triton inference server Server Perf Analyzer CLI

Field	Value
Page Type	Implementation
Title	Perf_Analyzer_CLI
Namespace	Triton_inference_server_Server
Domains	Performance, Model_Serving, Benchmarking
External Dependencies	perf_analyzer from nvcr.io/nvidia/tritonserver:<version>-py3-sdk or triton-model-analyzer pip package
Last Updated	2026-02-13 17:00 GMT

Overview

Concrete CLI tool for measuring inference performance on Triton Inference Server with configurable concurrency sweeps. Perf Analyzer is the standard benchmarking tool in the Triton ecosystem, capable of generating synthetic or real-data-driven inference requests at controlled concurrency levels and reporting throughput and latency statistics.

Description

Perf Analyzer connects to a running Triton Inference Server instance and sends inference requests at specified concurrency levels. It measures throughput (inferences per second) and latency percentiles (p50, p90, p95, p99) at each concurrency level, producing a tabular summary that characterizes the model's performance envelope.

The tool supports both HTTP and gRPC protocols, configurable input shapes and data, batch size specification, and measurement window control. It can operate in concurrency sweep mode (testing a range of concurrency values) or at a fixed concurrency level.

Key capabilities:

Sweep concurrency from a start to an end value with configurable step size
Report throughput and latency at each concurrency level
Support custom input data via JSON files for realistic workloads
Override input tensor shapes for models with variable-dimension inputs
Configure measurement intervals and stabilization windows

Usage

CLI Signature

perf_analyzer -m <model_name> \
  --concurrency-range <start:end[:step]> \
  --percentile=95 \
  [-u <host:port>] \
  [-i <http|grpc>] \
  [-b <batch_size>] \
  [--shape <input_name>:<d1>,<d2>,...] \
  [--input-data <file>] \
  [--measurement-interval <ms>] \
  [--stability-percentage <pct>]

Key Parameters

Parameter	Description	Default
`-m`	Model name to benchmark	(required)
`--concurrency-range`	Concurrency sweep range in format start:end[:step]	1
`--percentile`	Latency percentile to report (e.g., 95 for p95)	None (average)
`-u`	Server URL (host:port)	localhost:8000
`-i`	Protocol (http or grpc)	http
`-b`	Batch size per request	1
`--shape`	Override input shape (format: input_name:d1,d2,...)	Model default
`--input-data`	Path to JSON file with input data	Random/zero data
`--measurement-interval`	Measurement window in milliseconds	5000
`--stability-percentage`	Throughput variation threshold for stability	10

Code Reference

Source Location

docs/user_guide/performance_tuning.md:L71-107 -- Primary usage instructions for Perf Analyzer in the performance tuning workflow
docs/user_guide/performance_tuning.md:L285-300 -- Advanced Perf Analyzer options
docs/user_guide/optimization.md:L54-66 -- Perf Analyzer usage in optimization context

Import / Installation

# Option 1: Use the Triton SDK container (recommended)
docker run --rm --net=host nvcr.io/nvidia/tritonserver:<version>-py3-sdk \
  perf_analyzer -m <model_name> --concurrency-range 1:8

# Option 2: Install via pip (includes perf_analyzer)
pip install triton-model-analyzer

I/O Contract

Inputs

Input	Type	Required	Description
Model name	String	Yes	Name of the model deployed on Triton (must match model repository directory name)
Running Triton server	Service	Yes	A Triton Inference Server instance serving the target model
Concurrency range	String	No	Start:end[:step] concurrency sweep specification
Input data file	File (JSON)	No	JSON file containing representative input data for realistic benchmarking
Input shape overrides	String	No	Tensor shape specifications for variable-dimension inputs

Outputs

Output	Type	Description
Throughput	Float (inferences/sec)	Number of inferences completed per second at each concurrency level
Latency (p95/p99)	Integer (microseconds)	Latency percentile values at each concurrency level
Concurrency-performance table	Text (stdout)	Tabular summary mapping concurrency to throughput and latency

Usage Examples

Example 1: Basic concurrency sweep

Run a baseline measurement for a model named densenet_onnx sweeping concurrency from 1 to 8:

perf_analyzer -m densenet_onnx --concurrency-range 1:8 --percentile=95

Expected output:

Inferences/Second vs. Client p95 Batch Latency
Concurrency: 1, throughput: 265.147 infer/sec, latency 3820 usec
Concurrency: 2, throughput: 448.073 infer/sec, latency 4523 usec
Concurrency: 3, throughput: 532.890 infer/sec, latency 5735 usec
Concurrency: 4, throughput: 577.803 infer/sec, latency 7053 usec
Concurrency: 5, throughput: 585.205 infer/sec, latency 8698 usec
Concurrency: 6, throughput: 590.122 infer/sec, latency 10339 usec
Concurrency: 7, throughput: 591.050 infer/sec, latency 12035 usec
Concurrency: 8, throughput: 591.783 infer/sec, latency 13742 usec

Example 2: Custom input data and batch size

Run a baseline with batch size 4 and custom input data:

perf_analyzer -m my_model \
  --concurrency-range 1:16:2 \
  --percentile=95 \
  -b 4 \
  --input-data input_data.json \
  -u localhost:8001 \
  -i grpc

Example 3: Variable-shape input model

Specify input shape for a model with dynamic dimensions:

perf_analyzer -m bert_base \
  --concurrency-range 1:4 \
  --percentile=95 \
  --shape input_ids:1,128 \
  --shape attention_mask:1,128

Related Pages

Implements: Principle: Performance_Baseline -- implements::Principle:Triton_inference_server_Server_Performance_Baseline
Heuristic:Triton_inference_server_Server_Concurrency_Throughput_Rule

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment