Heuristic:Triton inference server Server Concurrency Throughput Rule
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Inference_Serving |
| Last Updated | 2026-02-13 17:00 GMT |
Overview
The 2x rule for setting optimal request concurrency: maximum throughput is achieved at concurrency = 2 * max_batch_size * model_instance_count.
Description
When benchmarking Triton with Performance Analyzer (perf_analyzer), there are two simple rules for setting request concurrency that reliably predict optimal throughput and minimum latency configurations. These rules apply when perf_analyzer runs on the same system as Triton (eliminating network latency as a variable).
The first rule establishes the minimum latency configuration. The second rule provides the concurrency setting that saturates the GPU pipeline for maximum throughput.
Usage
Use this heuristic when running perf_analyzer to benchmark model serving performance. Apply the minimum latency rule to establish a baseline, then the maximum throughput rule to find the optimal operating point.
The Insight (Rule of Thumb)
- Rule 1 (Minimum Latency): Set concurrency to 1, disable the dynamic batcher, and use 1 model instance.
- Rule 2 (Maximum Throughput): Set concurrency to 2 * <max_batch_size> * <model_instance_count>.
- Example: For max_batch_size=4 with 1 instance: concurrency = 2 * 4 * 1 = 8
- Example: For max_batch_size=4 with 2 instances: concurrency = 2 * 4 * 2 = 16
- Trade-off: Minimum latency leaves GPU underutilized. Maximum throughput increases per-request latency but processes far more requests per second.
Reasoning
The factor of 2 in the throughput rule ensures that while one batch is being computed on the GPU, the next batch is being assembled and its data is being transferred. This keeps the GPU pipeline full with no idle gaps between batches.
With concurrency = 1, Triton is idle during the round-trip time between sending a response and receiving the next request. Concurrency = 2 hides this communication latency. Beyond that, additional concurrency fills batches more completely.
Empirical evidence from docs/user_guide/optimization.md:128-145:
# Using the 2x rule: max_batch_size=4, instances=1 -> concurrency=8
$ perf_analyzer -m inception_onnx --percentile=95 --concurrency-range 8
...
Concurrency: 8, throughput: 267.8 infer/sec, latency 35590 usec
This matches the throughput found by sweeping concurrency 1-8, confirming the rule works as a shortcut.