Implementation:Triton inference server Server Fill Template

Metadata

Field	Value
Type	Implementation
Workflow	LLM_Deployment_With_TRT_LLM
Repo	Triton_inference_server_Server
Source	docs/getting_started/llm.md:L150-260
Domains	MLOps, NLP, Model_Serving
Knowledge_Sources	TRT-LLM Docs\|https://nvidia.github.io/TensorRT-LLM/, source::Repo\|Triton Server\|https://github.com/triton-inference-server/server
External_dep	tensorrtllm_backend repo (https://github.com/triton-inference-server/tensorrtllm_backend)
implements	Principle:Triton_inference_server_Server_TRT_LLM_Model_Repository_Setup
2026-02-13 17:00 GMT

Overview

Concrete template population utility from tensorrtllm_backend for configuring LLM ensemble configs. This implementation covers the exact fill_template.py invocations needed to configure all four ensemble components.

Description

The fill_template.py script reads a skeleton config.pbtxt file, replaces placeholder values with deployment-specific parameters, and writes the result back in-place. It must be run separately for each ensemble component (preprocessing, tensorrt_llm, postprocessing, ensemble).

The script is part of the tensorrtllm_backend repository, which provides the skeleton model repository structure and the TRT-LLM backend plugin for Triton.

Usage

Run from the root of the tensorrtllm_backend repository after cloning it and building the TRT engine. Each component requires a separate invocation with component-specific parameters.

Code Reference

Source Location

Item	Value
File	docs/getting_started/llm.md
Lines	L150-260
Repo	https://github.com/triton-inference-server/server
Script	tools/fill_template.py (in tensorrtllm_backend repo)

Signature

# General form
python3 tools/fill_template.py \
    --in_place \
    all_models/inflight_batcher_llm/<component>/config.pbtxt \
    <key>:<value>,<key>:<value>,...

Import / Verification

# Verify configs were populated (no remaining template placeholders)
grep -r '${' all_models/inflight_batcher_llm/*/config.pbtxt
# Should return no results if all templates are filled

I/O Contract

Inputs

Name	Type	Description
Skeleton model repo	Directory	`all_models/inflight_batcher_llm/` from tensorrtllm_backend
Compiled TRT engine	Directory	Engine directory from trtllm-build (e.g., `./phi-engine`)
HuggingFace tokenizer	Directory	Tokenizer directory (e.g., `./Phi-3-mini-4k-instruct`)
`--in_place`	Flag	Modifies the config.pbtxt file in place

Outputs

Name	Type	Description
Configured preprocessing/config.pbtxt	Protobuf text	Tokenization configuration with tokenizer type and path
Configured tensorrt_llm/config.pbtxt	Protobuf text	Engine configuration with batching, KV cache, and GPU memory settings
Configured postprocessing/config.pbtxt	Protobuf text	Detokenization configuration with tokenizer path
Configured ensemble/config.pbtxt	Protobuf text	Ensemble pipeline coordination configuration

Usage Examples

Configure preprocessing model

python3 tools/fill_template.py \
    --in_place \
    all_models/inflight_batcher_llm/preprocessing/config.pbtxt \
    triton_max_batch_size:8,\
tokenizer_type:auto,\
tokenizer_dir:/opt/Phi-3-mini-4k-instruct

Configure TensorRT-LLM model

python3 tools/fill_template.py \
    --in_place \
    all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt \
    triton_max_batch_size:8,\
decoupled_mode:true,\
engine_dir:/opt/phi-engine,\
batching_strategy:inflight_fused_batching,\
kv_cache_free_gpu_mem_fraction:0.9

Configure postprocessing model

python3 tools/fill_template.py \
    --in_place \
    all_models/inflight_batcher_llm/postprocessing/config.pbtxt \
    triton_max_batch_size:8,\
tokenizer_type:auto,\
tokenizer_dir:/opt/Phi-3-mini-4k-instruct

Configure ensemble model

python3 tools/fill_template.py \
    --in_place \
    all_models/inflight_batcher_llm/ensemble/config.pbtxt \
    triton_max_batch_size:8

Key Parameters

Parameter	Component	Description	Example Value
`triton_max_batch_size`	All	Maximum batch size for Triton scheduling	`8`
`tokenizer_type`	Preprocessing, Postprocessing	Tokenizer implementation type	`auto`
`tokenizer_dir`	Preprocessing, Postprocessing	Path to HuggingFace tokenizer	`/opt/Phi-3-mini-4k-instruct`
`decoupled_mode`	TensorRT-LLM	Enable streaming output	`true`
`engine_dir`	TensorRT-LLM	Path to compiled engine directory	`/opt/phi-engine`
`batching_strategy`	TensorRT-LLM	Batching mode	`inflight_fused_batching`
`kv_cache_free_gpu_mem_fraction`	TensorRT-LLM	Fraction of free GPU memory for KV cache	`0.9`

Related Pages

Principle:Triton_inference_server_Server_TRT_LLM_Model_Repository_Setup
Implementation:Triton_inference_server_Server_TRT_LLM_Run — Prerequisite: engine validation
Implementation:Triton_inference_server_Server_Trtllm_Build — Provides the engine directory
Implementation:Triton_inference_server_Server_Launch_Triton_Server_Script — Next step: server launch
Environment:Triton_inference_server_Server_TRT_LLM_Deployment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment