Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Triton inference server Server Perf Analyzer CLI

From Leeroopedia
Revision as of 13:59, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Triton_inference_server_Server_Perf_Analyzer_CLI.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Field Value
Page Type Implementation
Title Perf_Analyzer_CLI
Namespace Triton_inference_server_Server
Domains Performance, Model_Serving, Benchmarking
External Dependencies perf_analyzer from nvcr.io/nvidia/tritonserver:<version>-py3-sdk or triton-model-analyzer pip package
Last Updated 2026-02-13 17:00 GMT

Overview

Concrete CLI tool for measuring inference performance on Triton Inference Server with configurable concurrency sweeps. Perf Analyzer is the standard benchmarking tool in the Triton ecosystem, capable of generating synthetic or real-data-driven inference requests at controlled concurrency levels and reporting throughput and latency statistics.

Description

Perf Analyzer connects to a running Triton Inference Server instance and sends inference requests at specified concurrency levels. It measures throughput (inferences per second) and latency percentiles (p50, p90, p95, p99) at each concurrency level, producing a tabular summary that characterizes the model's performance envelope.

The tool supports both HTTP and gRPC protocols, configurable input shapes and data, batch size specification, and measurement window control. It can operate in concurrency sweep mode (testing a range of concurrency values) or at a fixed concurrency level.

Key capabilities:

  • Sweep concurrency from a start to an end value with configurable step size
  • Report throughput and latency at each concurrency level
  • Support custom input data via JSON files for realistic workloads
  • Override input tensor shapes for models with variable-dimension inputs
  • Configure measurement intervals and stabilization windows

Usage

CLI Signature

perf_analyzer -m <model_name> \
  --concurrency-range <start:end[:step]> \
  --percentile=95 \
  [-u <host:port>] \
  [-i <http|grpc>] \
  [-b <batch_size>] \
  [--shape <input_name>:<d1>,<d2>,...] \
  [--input-data <file>] \
  [--measurement-interval <ms>] \
  [--stability-percentage <pct>]

Key Parameters

Parameter Description Default
-m Model name to benchmark (required)
--concurrency-range Concurrency sweep range in format start:end[:step] 1
--percentile Latency percentile to report (e.g., 95 for p95) None (average)
-u Server URL (host:port) localhost:8000
-i Protocol (http or grpc) http
-b Batch size per request 1
--shape Override input shape (format: input_name:d1,d2,...) Model default
--input-data Path to JSON file with input data Random/zero data
--measurement-interval Measurement window in milliseconds 5000
--stability-percentage Throughput variation threshold for stability 10

Code Reference

Source Location

  • docs/user_guide/performance_tuning.md:L71-107 -- Primary usage instructions for Perf Analyzer in the performance tuning workflow
  • docs/user_guide/performance_tuning.md:L285-300 -- Advanced Perf Analyzer options
  • docs/user_guide/optimization.md:L54-66 -- Perf Analyzer usage in optimization context

Import / Installation

# Option 1: Use the Triton SDK container (recommended)
docker run --rm --net=host nvcr.io/nvidia/tritonserver:<version>-py3-sdk \
  perf_analyzer -m <model_name> --concurrency-range 1:8

# Option 2: Install via pip (includes perf_analyzer)
pip install triton-model-analyzer

I/O Contract

Inputs

Input Type Required Description
Model name String Yes Name of the model deployed on Triton (must match model repository directory name)
Running Triton server Service Yes A Triton Inference Server instance serving the target model
Concurrency range String No Start:end[:step] concurrency sweep specification
Input data file File (JSON) No JSON file containing representative input data for realistic benchmarking
Input shape overrides String No Tensor shape specifications for variable-dimension inputs

Outputs

Output Type Description
Throughput Float (inferences/sec) Number of inferences completed per second at each concurrency level
Latency (p95/p99) Integer (microseconds) Latency percentile values at each concurrency level
Concurrency-performance table Text (stdout) Tabular summary mapping concurrency to throughput and latency

Usage Examples

Example 1: Basic concurrency sweep

Run a baseline measurement for a model named densenet_onnx sweeping concurrency from 1 to 8:

perf_analyzer -m densenet_onnx --concurrency-range 1:8 --percentile=95

Expected output:

Inferences/Second vs. Client p95 Batch Latency
Concurrency: 1, throughput: 265.147 infer/sec, latency 3820 usec
Concurrency: 2, throughput: 448.073 infer/sec, latency 4523 usec
Concurrency: 3, throughput: 532.890 infer/sec, latency 5735 usec
Concurrency: 4, throughput: 577.803 infer/sec, latency 7053 usec
Concurrency: 5, throughput: 585.205 infer/sec, latency 8698 usec
Concurrency: 6, throughput: 590.122 infer/sec, latency 10339 usec
Concurrency: 7, throughput: 591.050 infer/sec, latency 12035 usec
Concurrency: 8, throughput: 591.783 infer/sec, latency 13742 usec

Example 2: Custom input data and batch size

Run a baseline with batch size 4 and custom input data:

perf_analyzer -m my_model \
  --concurrency-range 1:16:2 \
  --percentile=95 \
  -b 4 \
  --input-data input_data.json \
  -u localhost:8001 \
  -i grpc

Example 3: Variable-shape input model

Specify input shape for a model with dynamic dimensions:

perf_analyzer -m bert_base \
  --concurrency-range 1:4 \
  --percentile=95 \
  --shape input_ids:1,128 \
  --shape attention_mask:1,128

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment