Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Heuristic:Triton inference server Server Model Instance Scaling

From Leeroopedia
Knowledge Sources
Domains Optimization, Inference_Serving
Last Updated 2026-02-13 17:00 GMT

Overview

Two model instances typically provide the best throughput by overlapping memory transfer with GPU compute; more instances help only for small-compute models.

Description

Triton allows configuring multiple copies (instances) of each model via the instance_group setting in config.pbtxt. Each instance can process inference requests independently, allowing parallelism on the GPU. The optimal instance count depends on the model's computational footprint and memory transfer patterns.

For most models, two instances hit the sweet spot: while one instance computes on the GPU, the other can transfer data to/from GPU memory. Additional instances beyond two rarely help for large models but can benefit small models that underutilize the GPU with just two instances.

Usage

Use this heuristic when tuning instance_group count in config.pbtxt. Start with 2 instances, benchmark with perf_analyzer, and experiment with higher counts only for models with small computational footprints.

The Insight (Rule of Thumb)

  • Action: Set instance_group [ { count: 2 } ] in config.pbtxt as the starting point.
  • Value: 2 instances is the default recommendation for most models.
  • For small models: Try 3-4 instances and benchmark. Small models that complete quickly leave GPU idle between batches, so more instances can help fill gaps.
  • For large models: 2 instances is typically optimal or even too many (memory constraints). Use 1 instance if GPU memory is tight.
  • Trade-off: Each additional instance consumes GPU memory proportional to the model size. More instances improve throughput but reduce available memory for batching.

Reasoning

GPU inference involves three phases: (1) copy input data to GPU, (2) compute, (3) copy output data from GPU. With a single instance, the GPU is idle during phases 1 and 3. With two instances, one instance computes while the other transfers data, keeping the GPU busy.

Empirical evidence from docs/user_guide/optimization.md:147-159:

Two instances improve performance because they allow overlap of memory transfer operations (CPU to/from GPU) with inference compute. Multiple instances also improve GPU utilization by allowing more inference work to be executed simultaneously.

From FAQ (docs/user_guide/faq.md:107-135): Batching is the most beneficial optimization for GPU utilization. Two instances are useful for most models; additional instances rarely help except for small-compute models.

Combined optimization from docs/user_guide/optimization.md: Dynamic batching + 2 instances yields ~289.6 infer/sec, compared to ~272 infer/sec with batching alone (6.5% improvement).

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment