Implementation:Triton inference server Server Perf Analyzer Verification
| Field | Value |
|---|---|
| Page Type | Implementation |
| Title | Perf_Analyzer_Verification |
| Namespace | Triton_inference_server_Server |
| Domains | Performance, Quality_Assurance |
| External Dependencies | perf_analyzer from nvcr.io/nvidia/tritonserver:<version>-py3-sdk or triton-model-analyzer pip package; curl for Prometheus metrics endpoint |
| Last Updated | 2026-02-13 17:00 GMT |
Overview
Concrete re-benchmarking procedure using Perf Analyzer and Prometheus metrics to verify optimization gains. This implementation covers both the benchmark comparison (re-running Perf Analyzer with identical parameters) and runtime metrics collection (querying the Prometheus endpoint) to confirm that applied optimizations yield measurable improvement.
Description
Performance verification consists of two complementary procedures:
1. Benchmark Re-run: Execute the same Perf Analyzer command used during baselining against the server running with the optimized configuration. The identical parameters ensure a valid comparison.
2. Prometheus Metrics Collection: Query the Triton metrics endpoint to collect runtime performance counters, GPU utilization, and memory usage data that complement the point-in-time benchmark.
The combination of these two data sources provides a comprehensive view of optimization impact: Perf Analyzer gives controlled, reproducible throughput and latency measurements, while Prometheus metrics show resource utilization and operational health.
Usage
CLI Signature (Perf Analyzer Re-run)
# Run identical benchmark as baseline (same parameters)
perf_analyzer -m <model_name> \
--concurrency-range <start:end> \
--percentile=95 \
[-u <host:port>] \
[-b <batch_size>] \
[--shape <input_name>:<d1>,<d2>,...] \
[--input-data <file>]
CLI Signature (Prometheus Metrics)
# Query Triton Prometheus metrics endpoint
curl localhost:8002/metrics
Key Parameters
Perf Analyzer (same as baseline)
| Parameter | Description | Default |
|---|---|---|
-m |
Model name (must match baseline) | (required) |
--concurrency-range |
Concurrency sweep range (must match baseline) | 1 |
--percentile |
Latency percentile (must match baseline) | None |
-u |
Server URL | localhost:8000 |
-b |
Batch size (must match baseline) | 1 |
Triton Metrics Endpoint
| Parameter | Description | Default |
|---|---|---|
--allow-metrics |
Enable/disable metrics endpoint (Triton server flag) | true |
--metrics-port |
Port for metrics HTTP endpoint (Triton server flag) | 8002 |
--metrics-interval-ms |
Metrics collection interval in milliseconds (Triton server flag) | 2000 |
Key Prometheus Metrics
| Metric Name | Type | Description |
|---|---|---|
nv_inference_request_success |
Counter | Total number of successful inference requests |
nv_inference_count |
Counter | Total number of inferences performed (accounts for batch size) |
nv_inference_request_duration_us |
Counter | Cumulative end-to-end request duration in microseconds |
nv_gpu_utilization |
Gauge | GPU compute utilization (0.0 to 1.0) |
nv_gpu_memory_used_bytes |
Gauge | GPU memory currently in use in bytes |
Code Reference
Source Location
docs/user_guide/performance_tuning.md:L123-125-- Verification step in the performance tuning workflowdocs/user_guide/performance_tuning.md:L383-393-- Post-optimization verification instructionsdocs/user_guide/metrics.md:L28-351-- Triton Prometheus metrics reference
Import / Installation
# Perf Analyzer (same installation as baseline)
# Option 1: Triton SDK container
docker run --rm --net=host nvcr.io/nvidia/tritonserver:<version>-py3-sdk \
perf_analyzer -m <model_name> --concurrency-range 1:8 --percentile=95
# Option 2: pip install
pip install triton-model-analyzer
# Prometheus metrics: no additional installation needed (built into Triton server)
# curl is used to query the metrics endpoint
I/O Contract
Inputs
| Input | Type | Required | Description |
|---|---|---|---|
| Running server with optimized config | Service | Yes | Triton Inference Server running with the optimized model configuration applied |
| Baseline results | Data | Yes | Throughput and latency measurements from the baseline step for comparison |
| Model name | String | Yes | Name of the model being verified (must match baseline) |
| Concurrency range | String | Yes | Same concurrency range used in baseline measurement |
| Input data | File (JSON) | No | Same input data used in baseline measurement (if applicable) |
Outputs
| Output | Type | Description |
|---|---|---|
| Optimized throughput | Float (inferences/sec) | Throughput at each concurrency level with the optimized configuration |
| Optimized latency (p95/p99) | Integer (microseconds) | Latency percentiles at each concurrency level with the optimized configuration |
| Throughput improvement | Percentage | Computed improvement ratio vs baseline |
| Latency reduction | Percentage | Computed latency reduction vs baseline |
| Prometheus metrics | Text (Prometheus format) | Runtime metrics including GPU utilization and memory usage |
Usage Examples
Example 1: Re-run baseline benchmark with optimized config
Execute the same Perf Analyzer command used for baselining:
# Baseline command (previously run, results saved)
# perf_analyzer -m densenet_onnx --concurrency-range 1:8 --percentile=95
# Verification command (identical parameters, server now has optimized config)
perf_analyzer -m densenet_onnx --concurrency-range 1:8 --percentile=95
Expected output (optimized):
Inferences/Second vs. Client p95 Batch Latency
Concurrency: 1, throughput: 312.458 infer/sec, latency 3245 usec
Concurrency: 2, throughput: 589.721 infer/sec, latency 3438 usec
Concurrency: 3, throughput: 783.502 infer/sec, latency 3891 usec
Concurrency: 4, throughput: 872.104 infer/sec, latency 4652 usec
Concurrency: 5, throughput: 889.337 infer/sec, latency 5710 usec
Concurrency: 6, throughput: 891.205 infer/sec, latency 6834 usec
Concurrency: 7, throughput: 892.410 infer/sec, latency 7964 usec
Concurrency: 8, throughput: 892.783 infer/sec, latency 9105 usec
Comparison with baseline:
Concurrency | Baseline (inf/s) | Optimized (inf/s) | Improvement
------------|-------------------|--------------------|-----------
1 | 265.147 | 312.458 | +17.8%
4 | 577.803 | 872.104 | +50.9%
8 | 591.783 | 892.783 | +50.9%
Example 2: Collect Prometheus metrics
Query the Triton metrics endpoint for runtime performance data:
# Fetch all metrics
curl -s localhost:8002/metrics
# Filter for inference-specific metrics
curl -s localhost:8002/metrics | grep "nv_inference"
# Filter for GPU metrics
curl -s localhost:8002/metrics | grep "nv_gpu"
Expected output (filtered):
# HELP nv_inference_request_success Number of successful inference requests
# TYPE nv_inference_request_success counter
nv_inference_request_success{model="densenet_onnx",version="1"} 48523
# HELP nv_inference_count Number of inferences performed
# TYPE nv_inference_count counter
nv_inference_count{model="densenet_onnx",version="1"} 194092
# HELP nv_gpu_utilization GPU utilization rate (0.0 - 1.0)
# TYPE nv_gpu_utilization gauge
nv_gpu_utilization{gpu_uuid="GPU-abc123"} 0.847
# HELP nv_gpu_memory_used_bytes GPU memory used in bytes
# TYPE nv_gpu_memory_used_bytes gauge
nv_gpu_memory_used_bytes{gpu_uuid="GPU-abc123"} 2147483648
Example 3: Complete verification workflow
Full verification workflow combining Perf Analyzer re-run and Prometheus metrics:
# Step 1: Run verification benchmark
echo "=== Running verification benchmark ==="
perf_analyzer -m densenet_onnx --concurrency-range 1:8 --percentile=95
# Step 2: Collect Prometheus metrics
echo "=== Collecting Prometheus metrics ==="
curl -s localhost:8002/metrics | grep -E "nv_inference|nv_gpu"
# Step 3: Compare with baseline (manual comparison or scripted)
echo "=== Compare results with baseline ==="
echo "Baseline throughput at concurrency 8: 591.783 infer/sec"
echo "Optimized throughput at concurrency 8: 892.783 infer/sec"
echo "Improvement: +50.9%"
Related Pages
- Implements: Principle: Performance_Verification -- implements::Principle:Triton_inference_server_Server_Performance_Verification