Principle:Triton inference server Server TRT LLM Model Repository Setup
Metadata
| Field | Value |
|---|---|
| Type | Principle |
| Principle_type | External Tool Doc |
| Workflow | LLM_Deployment_With_TRT_LLM |
| Repo | Triton_inference_server_Server |
| Source | docs/getting_started/llm.md:L150-260 |
| Domains | MLOps, NLP, Model_Serving |
| Knowledge_Sources | TRT-LLM Docs|https://nvidia.github.io/TensorRT-LLM/, source::Repo|Triton Server|https://github.com/triton-inference-server/server |
| implemented_by | Implementation:Triton_inference_server_Server_Fill_Template |
| 2026-02-13 17:00 GMT |
Overview
Process of configuring a multi-component model repository for LLM ensemble serving with preprocessing, inference, and postprocessing stages.
Description
LLM deployment on Triton requires an ensemble pipeline with separate config.pbtxt files for preprocessing (tokenization), the TensorRT-LLM engine, postprocessing (detokenization), and an ensemble coordinator. The fill_template.py tool populates template config files with deployment-specific values.
The ensemble pipeline consists of four components:
- Preprocessing model — Handles tokenization of input text into token IDs. Configured with tokenizer type and path, and maps text inputs to the tensor format expected by the TRT-LLM engine
- TensorRT-LLM model — The core inference engine that runs the compiled TRT engine. Configured with engine directory path, batching strategy, KV cache settings, and GPU memory management parameters
- Postprocessing model — Handles detokenization of output token IDs back into text. Configured with the same tokenizer as preprocessing
- Ensemble model — Coordinates the data flow between preprocessing, inference, and postprocessing. Defines the input/output tensor mappings and execution order
Each component has its own config.pbtxt file in the model repository directory structure:
all_models/inflight_batcher_llm/
preprocessing/
config.pbtxt
1/
model.py
tensorrt_llm/
config.pbtxt
1/
postprocessing/
config.pbtxt
1/
model.py
ensemble/
config.pbtxt
1/
Usage
This principle is applied after engine validation and before server launch. The model repository is the primary artifact consumed by Triton Inference Server at startup.
Workflow context:
- Precedes: Principle:Triton_inference_server_Server_TRT_LLM_Server_Launch
- Depends on: Principle:Triton_inference_server_Server_Engine_Validation, Principle:Triton_inference_server_Server_TensorRT_Engine_Build
Theoretical Basis
Ensemble pipeline pattern:
preprocessing → engine → postprocessing
Coordinated by ensemble config. Each component has distinct configuration parameters.
The ensemble pattern provides several advantages for LLM serving:
- Separation of concerns — Tokenization, inference, and detokenization are independently configurable and upgradeable
- Batching optimization — The TRT-LLM backend can apply inflight batching (continuous batching) to maximize GPU utilization across multiple concurrent requests
- KV cache management — The
kv_cache_free_gpu_mem_fractionparameter controls how much GPU memory is reserved for the KV cache, balancing between batch size capacity and context length - Decoupled mode — Enables streaming token-by-token output, where the model sends partial responses as tokens are generated rather than waiting for the full sequence
Key configuration parameters:
| Parameter | Component | Description |
|---|---|---|
triton_max_batch_size |
All | Maximum batch size for Triton scheduling |
tokenizer_type |
Preprocessing | Tokenizer implementation type (e.g., auto)
|
tokenizer_dir |
Preprocessing/Postprocessing | Path to HuggingFace tokenizer files |
decoupled_mode |
TensorRT-LLM | Enable streaming output (true/false)
|
engine_dir |
TensorRT-LLM | Path to compiled engine directory |
batching_strategy |
TensorRT-LLM | Batching mode (inflight_fused_batching)
|
kv_cache_free_gpu_mem_fraction |
TensorRT-LLM | Fraction of free GPU memory for KV cache |
Related Pages
- Implementation:Triton_inference_server_Server_Fill_Template
- Principle:Triton_inference_server_Server_Engine_Validation — Prerequisite step
- Principle:Triton_inference_server_Server_TRT_LLM_Server_Launch — Next step: server launch