Implementation:Triton inference server Server Perf Analyzer Verification

Field	Value
Page Type	Implementation
Title	Perf_Analyzer_Verification
Namespace	Triton_inference_server_Server
Domains	Performance, Quality_Assurance
External Dependencies	perf_analyzer from nvcr.io/nvidia/tritonserver:<version>-py3-sdk or triton-model-analyzer pip package; curl for Prometheus metrics endpoint
Last Updated	2026-02-13 17:00 GMT

Overview

Concrete re-benchmarking procedure using Perf Analyzer and Prometheus metrics to verify optimization gains. This implementation covers both the benchmark comparison (re-running Perf Analyzer with identical parameters) and runtime metrics collection (querying the Prometheus endpoint) to confirm that applied optimizations yield measurable improvement.

Description

Performance verification consists of two complementary procedures:

1. Benchmark Re-run: Execute the same Perf Analyzer command used during baselining against the server running with the optimized configuration. The identical parameters ensure a valid comparison.

2. Prometheus Metrics Collection: Query the Triton metrics endpoint to collect runtime performance counters, GPU utilization, and memory usage data that complement the point-in-time benchmark.

The combination of these two data sources provides a comprehensive view of optimization impact: Perf Analyzer gives controlled, reproducible throughput and latency measurements, while Prometheus metrics show resource utilization and operational health.

Usage

CLI Signature (Perf Analyzer Re-run)

# Run identical benchmark as baseline (same parameters)
perf_analyzer -m <model_name> \
  --concurrency-range <start:end> \
  --percentile=95 \
  [-u <host:port>] \
  [-b <batch_size>] \
  [--shape <input_name>:<d1>,<d2>,...] \
  [--input-data <file>]

CLI Signature (Prometheus Metrics)

# Query Triton Prometheus metrics endpoint
curl localhost:8002/metrics

Key Parameters

Perf Analyzer (same as baseline)

Parameter	Description	Default
`-m`	Model name (must match baseline)	(required)
`--concurrency-range`	Concurrency sweep range (must match baseline)	1
`--percentile`	Latency percentile (must match baseline)	None
`-u`	Server URL	localhost:8000
`-b`	Batch size (must match baseline)	1

Triton Metrics Endpoint

Parameter	Description	Default
`--allow-metrics`	Enable/disable metrics endpoint (Triton server flag)	true
`--metrics-port`	Port for metrics HTTP endpoint (Triton server flag)	8002
`--metrics-interval-ms`	Metrics collection interval in milliseconds (Triton server flag)	2000

Key Prometheus Metrics

Metric Name	Type	Description
`nv_inference_request_success`	Counter	Total number of successful inference requests
`nv_inference_count`	Counter	Total number of inferences performed (accounts for batch size)
`nv_inference_request_duration_us`	Counter	Cumulative end-to-end request duration in microseconds
`nv_gpu_utilization`	Gauge	GPU compute utilization (0.0 to 1.0)
`nv_gpu_memory_used_bytes`	Gauge	GPU memory currently in use in bytes

Code Reference

Source Location

docs/user_guide/performance_tuning.md:L123-125 -- Verification step in the performance tuning workflow
docs/user_guide/performance_tuning.md:L383-393 -- Post-optimization verification instructions
docs/user_guide/metrics.md:L28-351 -- Triton Prometheus metrics reference

Import / Installation

# Perf Analyzer (same installation as baseline)
# Option 1: Triton SDK container
docker run --rm --net=host nvcr.io/nvidia/tritonserver:<version>-py3-sdk \
  perf_analyzer -m <model_name> --concurrency-range 1:8 --percentile=95

# Option 2: pip install
pip install triton-model-analyzer

# Prometheus metrics: no additional installation needed (built into Triton server)
# curl is used to query the metrics endpoint

I/O Contract

Inputs

Input	Type	Required	Description
Running server with optimized config	Service	Yes	Triton Inference Server running with the optimized model configuration applied
Baseline results	Data	Yes	Throughput and latency measurements from the baseline step for comparison
Model name	String	Yes	Name of the model being verified (must match baseline)
Concurrency range	String	Yes	Same concurrency range used in baseline measurement
Input data	File (JSON)	No	Same input data used in baseline measurement (if applicable)

Outputs

Output	Type	Description
Optimized throughput	Float (inferences/sec)	Throughput at each concurrency level with the optimized configuration
Optimized latency (p95/p99)	Integer (microseconds)	Latency percentiles at each concurrency level with the optimized configuration
Throughput improvement	Percentage	Computed improvement ratio vs baseline
Latency reduction	Percentage	Computed latency reduction vs baseline
Prometheus metrics	Text (Prometheus format)	Runtime metrics including GPU utilization and memory usage

Usage Examples

Example 1: Re-run baseline benchmark with optimized config

Execute the same Perf Analyzer command used for baselining:

# Baseline command (previously run, results saved)
# perf_analyzer -m densenet_onnx --concurrency-range 1:8 --percentile=95

# Verification command (identical parameters, server now has optimized config)
perf_analyzer -m densenet_onnx --concurrency-range 1:8 --percentile=95

Expected output (optimized):

Inferences/Second vs. Client p95 Batch Latency
Concurrency: 1, throughput: 312.458 infer/sec, latency 3245 usec
Concurrency: 2, throughput: 589.721 infer/sec, latency 3438 usec
Concurrency: 3, throughput: 783.502 infer/sec, latency 3891 usec
Concurrency: 4, throughput: 872.104 infer/sec, latency 4652 usec
Concurrency: 5, throughput: 889.337 infer/sec, latency 5710 usec
Concurrency: 6, throughput: 891.205 infer/sec, latency 6834 usec
Concurrency: 7, throughput: 892.410 infer/sec, latency 7964 usec
Concurrency: 8, throughput: 892.783 infer/sec, latency 9105 usec

Comparison with baseline:

Concurrency | Baseline (inf/s) | Optimized (inf/s) | Improvement
------------|-------------------|--------------------|-----------
1           | 265.147           | 312.458            | +17.8%
4           | 577.803           | 872.104            | +50.9%
8           | 591.783           | 892.783            | +50.9%

Example 2: Collect Prometheus metrics

Query the Triton metrics endpoint for runtime performance data:

# Fetch all metrics
curl -s localhost:8002/metrics

# Filter for inference-specific metrics
curl -s localhost:8002/metrics | grep "nv_inference"

# Filter for GPU metrics
curl -s localhost:8002/metrics | grep "nv_gpu"

Expected output (filtered):

# HELP nv_inference_request_success Number of successful inference requests
# TYPE nv_inference_request_success counter
nv_inference_request_success{model="densenet_onnx",version="1"} 48523

# HELP nv_inference_count Number of inferences performed
# TYPE nv_inference_count counter
nv_inference_count{model="densenet_onnx",version="1"} 194092

# HELP nv_gpu_utilization GPU utilization rate (0.0 - 1.0)
# TYPE nv_gpu_utilization gauge
nv_gpu_utilization{gpu_uuid="GPU-abc123"} 0.847

# HELP nv_gpu_memory_used_bytes GPU memory used in bytes
# TYPE nv_gpu_memory_used_bytes gauge
nv_gpu_memory_used_bytes{gpu_uuid="GPU-abc123"} 2147483648

Example 3: Complete verification workflow

Full verification workflow combining Perf Analyzer re-run and Prometheus metrics:

# Step 1: Run verification benchmark
echo "=== Running verification benchmark ==="
perf_analyzer -m densenet_onnx --concurrency-range 1:8 --percentile=95

# Step 2: Collect Prometheus metrics
echo "=== Collecting Prometheus metrics ==="
curl -s localhost:8002/metrics | grep -E "nv_inference|nv_gpu"

# Step 3: Compare with baseline (manual comparison or scripted)
echo "=== Compare results with baseline ==="
echo "Baseline throughput at concurrency 8: 591.783 infer/sec"
echo "Optimized throughput at concurrency 8: 892.783 infer/sec"
echo "Improvement: +50.9%"

Related Pages

Implements: Principle: Performance_Verification -- implements::Principle:Triton_inference_server_Server_Performance_Verification

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment