Principle:Triton inference server Server Performance Baseline
| Field | Value |
|---|---|
| Page Type | Principle |
| Title | Performance_Baseline |
| Namespace | Triton_inference_server_Server |
| Knowledge Sources | Triton Server|https://github.com/triton-inference-server/server, source::Doc|Perf Analyzer|https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/client/perf_analyzer.html |
| Domains | Performance, Model_Serving, Benchmarking |
| Last Updated | 2026-02-13 17:00 GMT |
Overview
Process of establishing reference throughput and latency measurements for a deployed model before optimization. Performance baselining is the critical first step in any model performance tuning workflow, as it provides the quantitative foundation against which all subsequent optimizations are measured.
Description
Performance baselining measures a model's inference throughput and latency at various concurrency levels before any optimization. This provides a reference point for quantifying improvement. Key metrics include inferences per second (throughput) and p95/p99 latency at each concurrency level. The baseline must be collected under controlled conditions with consistent hardware, input data, and measurement intervals.
A well-constructed baseline captures the relationship between offered load (concurrency) and observed performance (throughput, latency). As concurrency increases, throughput typically rises until a saturation point, after which additional concurrency yields diminishing returns and increased latency due to queuing. Identifying this saturation point is essential for understanding the model's default capacity.
Key metrics to capture during baselining:
- Throughput (inferences/sec) at each concurrency level
- p95 latency (microseconds) representing the 95th percentile response time
- p99 latency (microseconds) representing the 99th percentile response time
- GPU utilization during sustained load
- GPU memory consumption under peak load
Usage
Performance baselining is used in the following scenarios:
- Before optimization -- Establish a reference measurement before changing any model configuration parameters such as instance count, batch size, or dynamic batching settings.
- After deployment -- Validate that a newly deployed model meets minimum performance requirements on the target hardware.
- Hardware comparison -- Compare the same model across different GPU types (e.g., T4 vs A100) to inform infrastructure decisions.
- Regression detection -- Re-run baselines after framework or driver updates to detect performance regressions.
Baseline collection requirements:
- Use consistent input data (synthetic or representative real data)
- Warm up the model with several inference requests before measurement
- Use sufficiently long measurement intervals (at least 10 seconds per concurrency level)
- Sweep concurrency from 1 to the expected maximum concurrent request count
- Record all metrics at each concurrency level for later comparison
Theoretical Basis
Systematic measurement: fix input parameters, sweep concurrency levels, and record throughput/latency pairs. The concurrency-vs-throughput curve reveals saturation points and queuing effects.
The theoretical foundation rests on queuing theory. A Triton model instance acts as a server in a queuing system. When the arrival rate (concurrency) exceeds the service rate (inference throughput), requests queue, and latency increases non-linearly. The baseline captures this behavior empirically:
- Linear region -- Throughput scales roughly linearly with concurrency. Each additional concurrent request finds an available execution slot.
- Saturation region -- Throughput plateaus as all execution resources (GPU cores, model instances) are fully utilized.
- Overload region -- Additional concurrency causes queue buildup, sharply increasing latency without meaningful throughput gain.
The concurrency level at the transition from linear to saturation defines the model's natural parallelism under its default configuration. Optimization techniques (instance groups, dynamic batching) aim to shift this saturation point to higher throughput values.