Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Triton inference server Server Perf Analyzer Verification

From Leeroopedia
Field Value
Page Type Implementation
Title Perf_Analyzer_Verification
Namespace Triton_inference_server_Server
Domains Performance, Quality_Assurance
External Dependencies perf_analyzer from nvcr.io/nvidia/tritonserver:<version>-py3-sdk or triton-model-analyzer pip package; curl for Prometheus metrics endpoint
Last Updated 2026-02-13 17:00 GMT

Overview

Concrete re-benchmarking procedure using Perf Analyzer and Prometheus metrics to verify optimization gains. This implementation covers both the benchmark comparison (re-running Perf Analyzer with identical parameters) and runtime metrics collection (querying the Prometheus endpoint) to confirm that applied optimizations yield measurable improvement.

Description

Performance verification consists of two complementary procedures:

1. Benchmark Re-run: Execute the same Perf Analyzer command used during baselining against the server running with the optimized configuration. The identical parameters ensure a valid comparison.

2. Prometheus Metrics Collection: Query the Triton metrics endpoint to collect runtime performance counters, GPU utilization, and memory usage data that complement the point-in-time benchmark.

The combination of these two data sources provides a comprehensive view of optimization impact: Perf Analyzer gives controlled, reproducible throughput and latency measurements, while Prometheus metrics show resource utilization and operational health.

Usage

CLI Signature (Perf Analyzer Re-run)

# Run identical benchmark as baseline (same parameters)
perf_analyzer -m <model_name> \
  --concurrency-range <start:end> \
  --percentile=95 \
  [-u <host:port>] \
  [-b <batch_size>] \
  [--shape <input_name>:<d1>,<d2>,...] \
  [--input-data <file>]

CLI Signature (Prometheus Metrics)

# Query Triton Prometheus metrics endpoint
curl localhost:8002/metrics

Key Parameters

Perf Analyzer (same as baseline)

Parameter Description Default
-m Model name (must match baseline) (required)
--concurrency-range Concurrency sweep range (must match baseline) 1
--percentile Latency percentile (must match baseline) None
-u Server URL localhost:8000
-b Batch size (must match baseline) 1

Triton Metrics Endpoint

Parameter Description Default
--allow-metrics Enable/disable metrics endpoint (Triton server flag) true
--metrics-port Port for metrics HTTP endpoint (Triton server flag) 8002
--metrics-interval-ms Metrics collection interval in milliseconds (Triton server flag) 2000

Key Prometheus Metrics

Metric Name Type Description
nv_inference_request_success Counter Total number of successful inference requests
nv_inference_count Counter Total number of inferences performed (accounts for batch size)
nv_inference_request_duration_us Counter Cumulative end-to-end request duration in microseconds
nv_gpu_utilization Gauge GPU compute utilization (0.0 to 1.0)
nv_gpu_memory_used_bytes Gauge GPU memory currently in use in bytes

Code Reference

Source Location

  • docs/user_guide/performance_tuning.md:L123-125 -- Verification step in the performance tuning workflow
  • docs/user_guide/performance_tuning.md:L383-393 -- Post-optimization verification instructions
  • docs/user_guide/metrics.md:L28-351 -- Triton Prometheus metrics reference

Import / Installation

# Perf Analyzer (same installation as baseline)
# Option 1: Triton SDK container
docker run --rm --net=host nvcr.io/nvidia/tritonserver:<version>-py3-sdk \
  perf_analyzer -m <model_name> --concurrency-range 1:8 --percentile=95

# Option 2: pip install
pip install triton-model-analyzer

# Prometheus metrics: no additional installation needed (built into Triton server)
# curl is used to query the metrics endpoint

I/O Contract

Inputs

Input Type Required Description
Running server with optimized config Service Yes Triton Inference Server running with the optimized model configuration applied
Baseline results Data Yes Throughput and latency measurements from the baseline step for comparison
Model name String Yes Name of the model being verified (must match baseline)
Concurrency range String Yes Same concurrency range used in baseline measurement
Input data File (JSON) No Same input data used in baseline measurement (if applicable)

Outputs

Output Type Description
Optimized throughput Float (inferences/sec) Throughput at each concurrency level with the optimized configuration
Optimized latency (p95/p99) Integer (microseconds) Latency percentiles at each concurrency level with the optimized configuration
Throughput improvement Percentage Computed improvement ratio vs baseline
Latency reduction Percentage Computed latency reduction vs baseline
Prometheus metrics Text (Prometheus format) Runtime metrics including GPU utilization and memory usage

Usage Examples

Example 1: Re-run baseline benchmark with optimized config

Execute the same Perf Analyzer command used for baselining:

# Baseline command (previously run, results saved)
# perf_analyzer -m densenet_onnx --concurrency-range 1:8 --percentile=95

# Verification command (identical parameters, server now has optimized config)
perf_analyzer -m densenet_onnx --concurrency-range 1:8 --percentile=95

Expected output (optimized):

Inferences/Second vs. Client p95 Batch Latency
Concurrency: 1, throughput: 312.458 infer/sec, latency 3245 usec
Concurrency: 2, throughput: 589.721 infer/sec, latency 3438 usec
Concurrency: 3, throughput: 783.502 infer/sec, latency 3891 usec
Concurrency: 4, throughput: 872.104 infer/sec, latency 4652 usec
Concurrency: 5, throughput: 889.337 infer/sec, latency 5710 usec
Concurrency: 6, throughput: 891.205 infer/sec, latency 6834 usec
Concurrency: 7, throughput: 892.410 infer/sec, latency 7964 usec
Concurrency: 8, throughput: 892.783 infer/sec, latency 9105 usec

Comparison with baseline:

Concurrency | Baseline (inf/s) | Optimized (inf/s) | Improvement
------------|-------------------|--------------------|-----------
1           | 265.147           | 312.458            | +17.8%
4           | 577.803           | 872.104            | +50.9%
8           | 591.783           | 892.783            | +50.9%

Example 2: Collect Prometheus metrics

Query the Triton metrics endpoint for runtime performance data:

# Fetch all metrics
curl -s localhost:8002/metrics

# Filter for inference-specific metrics
curl -s localhost:8002/metrics | grep "nv_inference"

# Filter for GPU metrics
curl -s localhost:8002/metrics | grep "nv_gpu"

Expected output (filtered):

# HELP nv_inference_request_success Number of successful inference requests
# TYPE nv_inference_request_success counter
nv_inference_request_success{model="densenet_onnx",version="1"} 48523

# HELP nv_inference_count Number of inferences performed
# TYPE nv_inference_count counter
nv_inference_count{model="densenet_onnx",version="1"} 194092

# HELP nv_gpu_utilization GPU utilization rate (0.0 - 1.0)
# TYPE nv_gpu_utilization gauge
nv_gpu_utilization{gpu_uuid="GPU-abc123"} 0.847

# HELP nv_gpu_memory_used_bytes GPU memory used in bytes
# TYPE nv_gpu_memory_used_bytes gauge
nv_gpu_memory_used_bytes{gpu_uuid="GPU-abc123"} 2147483648

Example 3: Complete verification workflow

Full verification workflow combining Perf Analyzer re-run and Prometheus metrics:

# Step 1: Run verification benchmark
echo "=== Running verification benchmark ==="
perf_analyzer -m densenet_onnx --concurrency-range 1:8 --percentile=95

# Step 2: Collect Prometheus metrics
echo "=== Collecting Prometheus metrics ==="
curl -s localhost:8002/metrics | grep -E "nv_inference|nv_gpu"

# Step 3: Compare with baseline (manual comparison or scripted)
echo "=== Compare results with baseline ==="
echo "Baseline throughput at concurrency 8: 591.783 infer/sec"
echo "Optimized throughput at concurrency 8: 892.783 infer/sec"
echo "Improvement: +50.9%"

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment