Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Triton inference server Server Fill Template

From Leeroopedia

Metadata

Field Value
Type Implementation
Workflow LLM_Deployment_With_TRT_LLM
Repo Triton_inference_server_Server
Source docs/getting_started/llm.md:L150-260
Domains MLOps, NLP, Model_Serving
Knowledge_Sources TRT-LLM Docs|https://nvidia.github.io/TensorRT-LLM/, source::Repo|Triton Server|https://github.com/triton-inference-server/server
External_dep tensorrtllm_backend repo (https://github.com/triton-inference-server/tensorrtllm_backend)
implements Principle:Triton_inference_server_Server_TRT_LLM_Model_Repository_Setup
2026-02-13 17:00 GMT

Overview

Concrete template population utility from tensorrtllm_backend for configuring LLM ensemble configs. This implementation covers the exact fill_template.py invocations needed to configure all four ensemble components.

Description

The fill_template.py script reads a skeleton config.pbtxt file, replaces placeholder values with deployment-specific parameters, and writes the result back in-place. It must be run separately for each ensemble component (preprocessing, tensorrt_llm, postprocessing, ensemble).

The script is part of the tensorrtllm_backend repository, which provides the skeleton model repository structure and the TRT-LLM backend plugin for Triton.

Usage

Run from the root of the tensorrtllm_backend repository after cloning it and building the TRT engine. Each component requires a separate invocation with component-specific parameters.

Code Reference

Source Location

Item Value
File docs/getting_started/llm.md
Lines L150-260
Repo https://github.com/triton-inference-server/server
Script tools/fill_template.py (in tensorrtllm_backend repo)

Signature

# General form
python3 tools/fill_template.py \
    --in_place \
    all_models/inflight_batcher_llm/<component>/config.pbtxt \
    <key>:<value>,<key>:<value>,...

Import / Verification

# Verify configs were populated (no remaining template placeholders)
grep -r '${' all_models/inflight_batcher_llm/*/config.pbtxt
# Should return no results if all templates are filled

I/O Contract

Inputs

Name Type Description
Skeleton model repo Directory all_models/inflight_batcher_llm/ from tensorrtllm_backend
Compiled TRT engine Directory Engine directory from trtllm-build (e.g., ./phi-engine)
HuggingFace tokenizer Directory Tokenizer directory (e.g., ./Phi-3-mini-4k-instruct)
--in_place Flag Modifies the config.pbtxt file in place

Outputs

Name Type Description
Configured preprocessing/config.pbtxt Protobuf text Tokenization configuration with tokenizer type and path
Configured tensorrt_llm/config.pbtxt Protobuf text Engine configuration with batching, KV cache, and GPU memory settings
Configured postprocessing/config.pbtxt Protobuf text Detokenization configuration with tokenizer path
Configured ensemble/config.pbtxt Protobuf text Ensemble pipeline coordination configuration

Usage Examples

Configure preprocessing model

python3 tools/fill_template.py \
    --in_place \
    all_models/inflight_batcher_llm/preprocessing/config.pbtxt \
    triton_max_batch_size:8,\
tokenizer_type:auto,\
tokenizer_dir:/opt/Phi-3-mini-4k-instruct

Configure TensorRT-LLM model

python3 tools/fill_template.py \
    --in_place \
    all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt \
    triton_max_batch_size:8,\
decoupled_mode:true,\
engine_dir:/opt/phi-engine,\
batching_strategy:inflight_fused_batching,\
kv_cache_free_gpu_mem_fraction:0.9

Configure postprocessing model

python3 tools/fill_template.py \
    --in_place \
    all_models/inflight_batcher_llm/postprocessing/config.pbtxt \
    triton_max_batch_size:8,\
tokenizer_type:auto,\
tokenizer_dir:/opt/Phi-3-mini-4k-instruct

Configure ensemble model

python3 tools/fill_template.py \
    --in_place \
    all_models/inflight_batcher_llm/ensemble/config.pbtxt \
    triton_max_batch_size:8

Key Parameters

Parameter Component Description Example Value
triton_max_batch_size All Maximum batch size for Triton scheduling 8
tokenizer_type Preprocessing, Postprocessing Tokenizer implementation type auto
tokenizer_dir Preprocessing, Postprocessing Path to HuggingFace tokenizer /opt/Phi-3-mini-4k-instruct
decoupled_mode TensorRT-LLM Enable streaming output true
engine_dir TensorRT-LLM Path to compiled engine directory /opt/phi-engine
batching_strategy TensorRT-LLM Batching mode inflight_fused_batching
kv_cache_free_gpu_mem_fraction TensorRT-LLM Fraction of free GPU memory for KV cache 0.9

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment