Implementation:Triton inference server Server Optimal Config Application

Field	Value
Page Type	Implementation
Title	Optimal_Config_Application
Namespace	Triton_inference_server_Server
Domains	Performance, Model_Serving, Configuration
External Dependencies	None (uses standard filesystem operations and Triton model repository conventions)
Last Updated	2026-02-13 17:00 GMT

Overview

Concrete config.pbtxt update procedure for applying optimal serving parameters from Model Analyzer results. This implementation covers both the automated approach (copying the best configuration from Model Analyzer output) and the manual approach (editing config.pbtxt directly with tuned parameters).

Description

After Model Analyzer's analyze step identifies the top-ranked configuration, the optimal config.pbtxt must be applied to the production model repository. The Model Analyzer stores each profiled configuration variant in the output model repository, making deployment a simple file copy operation.

For manual optimization, the relevant configuration blocks (instance_group, dynamic_batching, max_batch_size, optimization) are edited directly in the model's config.pbtxt.

Key parameters to tune:

max_batch_size (int) -- Maximum batch size the server will form for this model
dynamic_batching.preferred_batch_size (list[int]) -- Preferred batch sizes the dynamic batcher will try to form
dynamic_batching.max_queue_delay_microseconds (int) -- Maximum time in microseconds to delay a request while waiting for a preferred batch size
instance_group[].count (int) -- Number of model instances to create
instance_group[].kind (KIND_GPU or KIND_CPU) -- Device type for model instances
optimization.execution_accelerators (tensorrt, openvino) -- Framework-specific inference acceleration

Usage

CLI Signature (Automated)

# Copy the optimal configuration from Model Analyzer results
cp ./results/<optimal_config>/config.pbtxt <model-repository>/<model-name>/config.pbtxt

# Reload the model on a running Triton server (if using explicit model control)
curl -X POST "http://localhost:8000/v2/repository/models/<model-name>/load"

Key Parameters

Parameter	Location in config.pbtxt	Type	Description
`max_batch_size`	Top-level	int	Maximum batch size for the model (0 disables batching)
`preferred_batch_size`	`dynamic_batching`	list[int]	Preferred batch sizes for dynamic batcher to form
`max_queue_delay_microseconds`	`dynamic_batching`	int	Maximum queue delay in microseconds
`count`	`instance_group[]`	int	Number of model instances per device
`kind`	`instance_group[]`	enum	Device type: KIND_GPU or KIND_CPU
`execution_accelerators`	`optimization`	block	TensorRT, OpenVINO, or other accelerator config

Code Reference

Source Location

docs/user_guide/performance_tuning.md:L370-386 -- Applying optimal configuration from Model Analyzer
docs/user_guide/model_configuration.md:L545-681 -- instance_group configuration reference
docs/user_guide/batcher.md:L32-151 -- dynamic_batching configuration reference
docs/user_guide/optimization.md:L91-295 -- Optimization and execution accelerators reference

Configuration Template

name: "model_name"
platform: "onnxruntime_onnx"
max_batch_size: 8

input [
  {
    name: "input"
    data_type: TYPE_FP32
    dims: [ 3, 224, 224 ]
  }
]

output [
  {
    name: "output"
    data_type: TYPE_FP32
    dims: [ 1000 ]
  }
]

instance_group [
  {
    count: 2
    kind: KIND_GPU
    gpus: [ 0 ]
  }
]

dynamic_batching {
  preferred_batch_size: [ 4, 8 ]
  max_queue_delay_microseconds: 100
}

optimization {
  execution_accelerators {
    gpu_execution_accelerator : [ {
      name : "tensorrt"
      parameters { key: "precision_mode" value: "FP16" }
      parameters { key: "max_workspace_size_bytes" value: "1073741824" }
    }]
  }
}

I/O Contract

Inputs

Input	Type	Required	Description
Optimal config from Model Analyzer	File (config.pbtxt)	Yes (automated)	The top-ranked configuration file from the Model Analyzer output repository
Model repository path	Directory path	Yes	Path to the production Triton model repository
Model name	String	Yes	Name of the model whose configuration is being updated
Tuning parameters	Various	Yes (manual)	Specific values for instance_group, dynamic_batching, max_batch_size, optimization blocks

Outputs

Output	Type	Description
Updated config.pbtxt	File	The model's configuration file with optimized serving parameters applied
Model reload confirmation	HTTP response	Confirmation that the model was successfully reloaded with the new configuration (when using explicit model control)

Usage Examples

Example 1: Apply optimal config from Model Analyzer

Copy the top-ranked configuration from Model Analyzer results:

# List available configurations in the output repository
ls ./results/

# Copy the best configuration (identified from analyze output)
cp ./results/densenet_onnx_config_3/config.pbtxt \
   /models/densenet_onnx/config.pbtxt

# Reload the model on a running server
curl -X POST "http://localhost:8000/v2/repository/models/densenet_onnx/load"

Example 2: Manual config optimization -- enable dynamic batching

Edit config.pbtxt to add dynamic batching:

# Add to config.pbtxt
max_batch_size: 8

dynamic_batching {
  preferred_batch_size: [ 4, 8 ]
  max_queue_delay_microseconds: 100
}

Example 3: Manual config optimization -- increase instance count

Edit config.pbtxt to increase GPU instances:

# Update instance_group in config.pbtxt
instance_group [
  {
    count: 3
    kind: KIND_GPU
    gpus: [ 0 ]
  }
]

Example 4: Manual config optimization -- enable TensorRT acceleration

Add TensorRT execution accelerator for an ONNX model:

# Add optimization block to config.pbtxt
optimization {
  execution_accelerators {
    gpu_execution_accelerator : [ {
      name : "tensorrt"
      parameters { key: "precision_mode" value: "FP16" }
      parameters { key: "max_workspace_size_bytes" value: "1073741824" }
    }]
  }
}

Related Pages

Implements: Principle: Config_Optimization -- implements::Principle:Triton_inference_server_Server_Config_Optimization
Heuristic:Triton_inference_server_Server_Dynamic_Batching_Tuning
Heuristic:Triton_inference_server_Server_Model_Instance_Scaling

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment