Implementation:Triton inference server Server Perf Analyzer CLI
| Field | Value |
|---|---|
| Page Type | Implementation |
| Title | Perf_Analyzer_CLI |
| Namespace | Triton_inference_server_Server |
| Domains | Performance, Model_Serving, Benchmarking |
| External Dependencies | perf_analyzer from nvcr.io/nvidia/tritonserver:<version>-py3-sdk or triton-model-analyzer pip package |
| Last Updated | 2026-02-13 17:00 GMT |
Overview
Concrete CLI tool for measuring inference performance on Triton Inference Server with configurable concurrency sweeps. Perf Analyzer is the standard benchmarking tool in the Triton ecosystem, capable of generating synthetic or real-data-driven inference requests at controlled concurrency levels and reporting throughput and latency statistics.
Description
Perf Analyzer connects to a running Triton Inference Server instance and sends inference requests at specified concurrency levels. It measures throughput (inferences per second) and latency percentiles (p50, p90, p95, p99) at each concurrency level, producing a tabular summary that characterizes the model's performance envelope.
The tool supports both HTTP and gRPC protocols, configurable input shapes and data, batch size specification, and measurement window control. It can operate in concurrency sweep mode (testing a range of concurrency values) or at a fixed concurrency level.
Key capabilities:
- Sweep concurrency from a start to an end value with configurable step size
- Report throughput and latency at each concurrency level
- Support custom input data via JSON files for realistic workloads
- Override input tensor shapes for models with variable-dimension inputs
- Configure measurement intervals and stabilization windows
Usage
CLI Signature
perf_analyzer -m <model_name> \
--concurrency-range <start:end[:step]> \
--percentile=95 \
[-u <host:port>] \
[-i <http|grpc>] \
[-b <batch_size>] \
[--shape <input_name>:<d1>,<d2>,...] \
[--input-data <file>] \
[--measurement-interval <ms>] \
[--stability-percentage <pct>]
Key Parameters
| Parameter | Description | Default |
|---|---|---|
-m |
Model name to benchmark | (required) |
--concurrency-range |
Concurrency sweep range in format start:end[:step] | 1 |
--percentile |
Latency percentile to report (e.g., 95 for p95) | None (average) |
-u |
Server URL (host:port) | localhost:8000 |
-i |
Protocol (http or grpc) | http |
-b |
Batch size per request | 1 |
--shape |
Override input shape (format: input_name:d1,d2,...) | Model default |
--input-data |
Path to JSON file with input data | Random/zero data |
--measurement-interval |
Measurement window in milliseconds | 5000 |
--stability-percentage |
Throughput variation threshold for stability | 10 |
Code Reference
Source Location
docs/user_guide/performance_tuning.md:L71-107-- Primary usage instructions for Perf Analyzer in the performance tuning workflowdocs/user_guide/performance_tuning.md:L285-300-- Advanced Perf Analyzer optionsdocs/user_guide/optimization.md:L54-66-- Perf Analyzer usage in optimization context
Import / Installation
# Option 1: Use the Triton SDK container (recommended)
docker run --rm --net=host nvcr.io/nvidia/tritonserver:<version>-py3-sdk \
perf_analyzer -m <model_name> --concurrency-range 1:8
# Option 2: Install via pip (includes perf_analyzer)
pip install triton-model-analyzer
I/O Contract
Inputs
| Input | Type | Required | Description |
|---|---|---|---|
| Model name | String | Yes | Name of the model deployed on Triton (must match model repository directory name) |
| Running Triton server | Service | Yes | A Triton Inference Server instance serving the target model |
| Concurrency range | String | No | Start:end[:step] concurrency sweep specification |
| Input data file | File (JSON) | No | JSON file containing representative input data for realistic benchmarking |
| Input shape overrides | String | No | Tensor shape specifications for variable-dimension inputs |
Outputs
| Output | Type | Description |
|---|---|---|
| Throughput | Float (inferences/sec) | Number of inferences completed per second at each concurrency level |
| Latency (p95/p99) | Integer (microseconds) | Latency percentile values at each concurrency level |
| Concurrency-performance table | Text (stdout) | Tabular summary mapping concurrency to throughput and latency |
Usage Examples
Example 1: Basic concurrency sweep
Run a baseline measurement for a model named densenet_onnx sweeping concurrency from 1 to 8:
perf_analyzer -m densenet_onnx --concurrency-range 1:8 --percentile=95
Expected output:
Inferences/Second vs. Client p95 Batch Latency
Concurrency: 1, throughput: 265.147 infer/sec, latency 3820 usec
Concurrency: 2, throughput: 448.073 infer/sec, latency 4523 usec
Concurrency: 3, throughput: 532.890 infer/sec, latency 5735 usec
Concurrency: 4, throughput: 577.803 infer/sec, latency 7053 usec
Concurrency: 5, throughput: 585.205 infer/sec, latency 8698 usec
Concurrency: 6, throughput: 590.122 infer/sec, latency 10339 usec
Concurrency: 7, throughput: 591.050 infer/sec, latency 12035 usec
Concurrency: 8, throughput: 591.783 infer/sec, latency 13742 usec
Example 2: Custom input data and batch size
Run a baseline with batch size 4 and custom input data:
perf_analyzer -m my_model \
--concurrency-range 1:16:2 \
--percentile=95 \
-b 4 \
--input-data input_data.json \
-u localhost:8001 \
-i grpc
Example 3: Variable-shape input model
Specify input shape for a model with dynamic dimensions:
perf_analyzer -m bert_base \
--concurrency-range 1:4 \
--percentile=95 \
--shape input_ids:1,128 \
--shape attention_mask:1,128
Related Pages
- Implements: Principle: Performance_Baseline -- implements::Principle:Triton_inference_server_Server_Performance_Baseline
- Heuristic:Triton_inference_server_Server_Concurrency_Throughput_Rule