Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Heuristic:Triton inference server Server Concurrency Throughput Rule

From Leeroopedia
Knowledge Sources
Domains Optimization, Inference_Serving
Last Updated 2026-02-13 17:00 GMT

Overview

The 2x rule for setting optimal request concurrency: maximum throughput is achieved at concurrency = 2 * max_batch_size * model_instance_count.

Description

When benchmarking Triton with Performance Analyzer (perf_analyzer), there are two simple rules for setting request concurrency that reliably predict optimal throughput and minimum latency configurations. These rules apply when perf_analyzer runs on the same system as Triton (eliminating network latency as a variable).

The first rule establishes the minimum latency configuration. The second rule provides the concurrency setting that saturates the GPU pipeline for maximum throughput.

Usage

Use this heuristic when running perf_analyzer to benchmark model serving performance. Apply the minimum latency rule to establish a baseline, then the maximum throughput rule to find the optimal operating point.

The Insight (Rule of Thumb)

  • Rule 1 (Minimum Latency): Set concurrency to 1, disable the dynamic batcher, and use 1 model instance.
  • Rule 2 (Maximum Throughput): Set concurrency to 2 * <max_batch_size> * <model_instance_count>.
    • Example: For max_batch_size=4 with 1 instance: concurrency = 2 * 4 * 1 = 8
    • Example: For max_batch_size=4 with 2 instances: concurrency = 2 * 4 * 2 = 16
  • Trade-off: Minimum latency leaves GPU underutilized. Maximum throughput increases per-request latency but processes far more requests per second.

Reasoning

The factor of 2 in the throughput rule ensures that while one batch is being computed on the GPU, the next batch is being assembled and its data is being transferred. This keeps the GPU pipeline full with no idle gaps between batches.

With concurrency = 1, Triton is idle during the round-trip time between sending a response and receiving the next request. Concurrency = 2 hides this communication latency. Beyond that, additional concurrency fills batches more completely.

Empirical evidence from docs/user_guide/optimization.md:128-145:

# Using the 2x rule: max_batch_size=4, instances=1 -> concurrency=8
$ perf_analyzer -m inception_onnx --percentile=95 --concurrency-range 8
...
Concurrency: 8, throughput: 267.8 infer/sec, latency 35590 usec

This matches the throughput found by sweeping concurrency 1-8, confirming the rule works as a shortcut.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment