Implementation:Triton inference server Server Optimal Config Application
| Field | Value |
|---|---|
| Page Type | Implementation |
| Title | Optimal_Config_Application |
| Namespace | Triton_inference_server_Server |
| Domains | Performance, Model_Serving, Configuration |
| External Dependencies | None (uses standard filesystem operations and Triton model repository conventions) |
| Last Updated | 2026-02-13 17:00 GMT |
Overview
Concrete config.pbtxt update procedure for applying optimal serving parameters from Model Analyzer results. This implementation covers both the automated approach (copying the best configuration from Model Analyzer output) and the manual approach (editing config.pbtxt directly with tuned parameters).
Description
After Model Analyzer's analyze step identifies the top-ranked configuration, the optimal config.pbtxt must be applied to the production model repository. The Model Analyzer stores each profiled configuration variant in the output model repository, making deployment a simple file copy operation.
For manual optimization, the relevant configuration blocks (instance_group, dynamic_batching, max_batch_size, optimization) are edited directly in the model's config.pbtxt.
Key parameters to tune:
- max_batch_size (int) -- Maximum batch size the server will form for this model
- dynamic_batching.preferred_batch_size (list[int]) -- Preferred batch sizes the dynamic batcher will try to form
- dynamic_batching.max_queue_delay_microseconds (int) -- Maximum time in microseconds to delay a request while waiting for a preferred batch size
- instance_group[].count (int) -- Number of model instances to create
- instance_group[].kind (KIND_GPU or KIND_CPU) -- Device type for model instances
- optimization.execution_accelerators (tensorrt, openvino) -- Framework-specific inference acceleration
Usage
CLI Signature (Automated)
# Copy the optimal configuration from Model Analyzer results
cp ./results/<optimal_config>/config.pbtxt <model-repository>/<model-name>/config.pbtxt
# Reload the model on a running Triton server (if using explicit model control)
curl -X POST "http://localhost:8000/v2/repository/models/<model-name>/load"
Key Parameters
| Parameter | Location in config.pbtxt | Type | Description |
|---|---|---|---|
max_batch_size |
Top-level | int | Maximum batch size for the model (0 disables batching) |
preferred_batch_size |
dynamic_batching |
list[int] | Preferred batch sizes for dynamic batcher to form |
max_queue_delay_microseconds |
dynamic_batching |
int | Maximum queue delay in microseconds |
count |
instance_group[] |
int | Number of model instances per device |
kind |
instance_group[] |
enum | Device type: KIND_GPU or KIND_CPU |
execution_accelerators |
optimization |
block | TensorRT, OpenVINO, or other accelerator config |
Code Reference
Source Location
docs/user_guide/performance_tuning.md:L370-386-- Applying optimal configuration from Model Analyzerdocs/user_guide/model_configuration.md:L545-681-- instance_group configuration referencedocs/user_guide/batcher.md:L32-151-- dynamic_batching configuration referencedocs/user_guide/optimization.md:L91-295-- Optimization and execution accelerators reference
Configuration Template
name: "model_name"
platform: "onnxruntime_onnx"
max_batch_size: 8
input [
{
name: "input"
data_type: TYPE_FP32
dims: [ 3, 224, 224 ]
}
]
output [
{
name: "output"
data_type: TYPE_FP32
dims: [ 1000 ]
}
]
instance_group [
{
count: 2
kind: KIND_GPU
gpus: [ 0 ]
}
]
dynamic_batching {
preferred_batch_size: [ 4, 8 ]
max_queue_delay_microseconds: 100
}
optimization {
execution_accelerators {
gpu_execution_accelerator : [ {
name : "tensorrt"
parameters { key: "precision_mode" value: "FP16" }
parameters { key: "max_workspace_size_bytes" value: "1073741824" }
}]
}
}
I/O Contract
Inputs
| Input | Type | Required | Description |
|---|---|---|---|
| Optimal config from Model Analyzer | File (config.pbtxt) | Yes (automated) | The top-ranked configuration file from the Model Analyzer output repository |
| Model repository path | Directory path | Yes | Path to the production Triton model repository |
| Model name | String | Yes | Name of the model whose configuration is being updated |
| Tuning parameters | Various | Yes (manual) | Specific values for instance_group, dynamic_batching, max_batch_size, optimization blocks |
Outputs
| Output | Type | Description |
|---|---|---|
| Updated config.pbtxt | File | The model's configuration file with optimized serving parameters applied |
| Model reload confirmation | HTTP response | Confirmation that the model was successfully reloaded with the new configuration (when using explicit model control) |
Usage Examples
Example 1: Apply optimal config from Model Analyzer
Copy the top-ranked configuration from Model Analyzer results:
# List available configurations in the output repository
ls ./results/
# Copy the best configuration (identified from analyze output)
cp ./results/densenet_onnx_config_3/config.pbtxt \
/models/densenet_onnx/config.pbtxt
# Reload the model on a running server
curl -X POST "http://localhost:8000/v2/repository/models/densenet_onnx/load"
Example 2: Manual config optimization -- enable dynamic batching
Edit config.pbtxt to add dynamic batching:
# Add to config.pbtxt
max_batch_size: 8
dynamic_batching {
preferred_batch_size: [ 4, 8 ]
max_queue_delay_microseconds: 100
}
Example 3: Manual config optimization -- increase instance count
Edit config.pbtxt to increase GPU instances:
# Update instance_group in config.pbtxt
instance_group [
{
count: 3
kind: KIND_GPU
gpus: [ 0 ]
}
]
Example 4: Manual config optimization -- enable TensorRT acceleration
Add TensorRT execution accelerator for an ONNX model:
# Add optimization block to config.pbtxt
optimization {
execution_accelerators {
gpu_execution_accelerator : [ {
name : "tensorrt"
parameters { key: "precision_mode" value: "FP16" }
parameters { key: "max_workspace_size_bytes" value: "1073741824" }
}]
}
}
Related Pages
- Implements: Principle: Config_Optimization -- implements::Principle:Triton_inference_server_Server_Config_Optimization
- Heuristic:Triton_inference_server_Server_Dynamic_Batching_Tuning
- Heuristic:Triton_inference_server_Server_Model_Instance_Scaling