Metadata
Overview
Concrete template population utility from tensorrtllm_backend for configuring LLM ensemble configs. This implementation covers the exact fill_template.py invocations needed to configure all four ensemble components.
Description
The fill_template.py script reads a skeleton config.pbtxt file, replaces placeholder values with deployment-specific parameters, and writes the result back in-place. It must be run separately for each ensemble component (preprocessing, tensorrt_llm, postprocessing, ensemble).
The script is part of the tensorrtllm_backend repository, which provides the skeleton model repository structure and the TRT-LLM backend plugin for Triton.
Usage
Run from the root of the tensorrtllm_backend repository after cloning it and building the TRT engine. Each component requires a separate invocation with component-specific parameters.
Code Reference
Source Location
Signature
# General form
python3 tools/fill_template.py \
--in_place \
all_models/inflight_batcher_llm/<component>/config.pbtxt \
<key>:<value>,<key>:<value>,...
Import / Verification
# Verify configs were populated (no remaining template placeholders)
grep -r '${' all_models/inflight_batcher_llm/*/config.pbtxt
# Should return no results if all templates are filled
I/O Contract
Inputs
| Name |
Type |
Description
|
| Skeleton model repo |
Directory |
all_models/inflight_batcher_llm/ from tensorrtllm_backend
|
| Compiled TRT engine |
Directory |
Engine directory from trtllm-build (e.g., ./phi-engine)
|
| HuggingFace tokenizer |
Directory |
Tokenizer directory (e.g., ./Phi-3-mini-4k-instruct)
|
--in_place |
Flag |
Modifies the config.pbtxt file in place
|
Outputs
| Name |
Type |
Description
|
| Configured preprocessing/config.pbtxt |
Protobuf text |
Tokenization configuration with tokenizer type and path
|
| Configured tensorrt_llm/config.pbtxt |
Protobuf text |
Engine configuration with batching, KV cache, and GPU memory settings
|
| Configured postprocessing/config.pbtxt |
Protobuf text |
Detokenization configuration with tokenizer path
|
| Configured ensemble/config.pbtxt |
Protobuf text |
Ensemble pipeline coordination configuration
|
Usage Examples
Configure preprocessing model
python3 tools/fill_template.py \
--in_place \
all_models/inflight_batcher_llm/preprocessing/config.pbtxt \
triton_max_batch_size:8,\
tokenizer_type:auto,\
tokenizer_dir:/opt/Phi-3-mini-4k-instruct
Configure TensorRT-LLM model
python3 tools/fill_template.py \
--in_place \
all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt \
triton_max_batch_size:8,\
decoupled_mode:true,\
engine_dir:/opt/phi-engine,\
batching_strategy:inflight_fused_batching,\
kv_cache_free_gpu_mem_fraction:0.9
Configure postprocessing model
python3 tools/fill_template.py \
--in_place \
all_models/inflight_batcher_llm/postprocessing/config.pbtxt \
triton_max_batch_size:8,\
tokenizer_type:auto,\
tokenizer_dir:/opt/Phi-3-mini-4k-instruct
Configure ensemble model
python3 tools/fill_template.py \
--in_place \
all_models/inflight_batcher_llm/ensemble/config.pbtxt \
triton_max_batch_size:8
Key Parameters
| Parameter |
Component |
Description |
Example Value
|
triton_max_batch_size |
All |
Maximum batch size for Triton scheduling |
8
|
tokenizer_type |
Preprocessing, Postprocessing |
Tokenizer implementation type |
auto
|
tokenizer_dir |
Preprocessing, Postprocessing |
Path to HuggingFace tokenizer |
/opt/Phi-3-mini-4k-instruct
|
decoupled_mode |
TensorRT-LLM |
Enable streaming output |
true
|
engine_dir |
TensorRT-LLM |
Path to compiled engine directory |
/opt/phi-engine
|
batching_strategy |
TensorRT-LLM |
Batching mode |
inflight_fused_batching
|
kv_cache_free_gpu_mem_fraction |
TensorRT-LLM |
Fraction of free GPU memory for KV cache |
0.9
|
Related Pages
Page Connections
Double-click a node to navigate. Hold to expand connections.