Principle:Triton inference server Server TRT LLM Model Repository Setup

Metadata

Field	Value
Type	Principle
Principle_type	External Tool Doc
Workflow	LLM_Deployment_With_TRT_LLM
Repo	Triton_inference_server_Server
Source	docs/getting_started/llm.md:L150-260
Domains	MLOps, NLP, Model_Serving
Knowledge_Sources	TRT-LLM Docs\|https://nvidia.github.io/TensorRT-LLM/, source::Repo\|Triton Server\|https://github.com/triton-inference-server/server
implemented_by	Implementation:Triton_inference_server_Server_Fill_Template
2026-02-13 17:00 GMT

Overview

Process of configuring a multi-component model repository for LLM ensemble serving with preprocessing, inference, and postprocessing stages.

Description

LLM deployment on Triton requires an ensemble pipeline with separate config.pbtxt files for preprocessing (tokenization), the TensorRT-LLM engine, postprocessing (detokenization), and an ensemble coordinator. The fill_template.py tool populates template config files with deployment-specific values.

The ensemble pipeline consists of four components:

Preprocessing model — Handles tokenization of input text into token IDs. Configured with tokenizer type and path, and maps text inputs to the tensor format expected by the TRT-LLM engine
TensorRT-LLM model — The core inference engine that runs the compiled TRT engine. Configured with engine directory path, batching strategy, KV cache settings, and GPU memory management parameters
Postprocessing model — Handles detokenization of output token IDs back into text. Configured with the same tokenizer as preprocessing
Ensemble model — Coordinates the data flow between preprocessing, inference, and postprocessing. Defines the input/output tensor mappings and execution order

Each component has its own config.pbtxt file in the model repository directory structure:

all_models/inflight_batcher_llm/
  preprocessing/
    config.pbtxt
    1/
      model.py
  tensorrt_llm/
    config.pbtxt
    1/
  postprocessing/
    config.pbtxt
    1/
      model.py
  ensemble/
    config.pbtxt
    1/

Usage

This principle is applied after engine validation and before server launch. The model repository is the primary artifact consumed by Triton Inference Server at startup.

Workflow context:

Precedes: Principle:Triton_inference_server_Server_TRT_LLM_Server_Launch
Depends on: Principle:Triton_inference_server_Server_Engine_Validation, Principle:Triton_inference_server_Server_TensorRT_Engine_Build

Theoretical Basis

Ensemble pipeline pattern:

preprocessing → engine → postprocessing

Coordinated by ensemble config. Each component has distinct configuration parameters.

The ensemble pattern provides several advantages for LLM serving:

Separation of concerns — Tokenization, inference, and detokenization are independently configurable and upgradeable
Batching optimization — The TRT-LLM backend can apply inflight batching (continuous batching) to maximize GPU utilization across multiple concurrent requests
KV cache management — The kv_cache_free_gpu_mem_fraction parameter controls how much GPU memory is reserved for the KV cache, balancing between batch size capacity and context length
Decoupled mode — Enables streaming token-by-token output, where the model sends partial responses as tokens are generated rather than waiting for the full sequence

Key configuration parameters:

Parameter	Component	Description
`triton_max_batch_size`	All	Maximum batch size for Triton scheduling
`tokenizer_type`	Preprocessing	Tokenizer implementation type (e.g., `auto`)
`tokenizer_dir`	Preprocessing/Postprocessing	Path to HuggingFace tokenizer files
`decoupled_mode`	TensorRT-LLM	Enable streaming output (`true`/`false`)
`engine_dir`	TensorRT-LLM	Path to compiled engine directory
`batching_strategy`	TensorRT-LLM	Batching mode (`inflight_fused_batching`)
`kv_cache_free_gpu_mem_fraction`	TensorRT-LLM	Fraction of free GPU memory for KV cache

Related Pages

Implementation:Triton_inference_server_Server_Fill_Template
Principle:Triton_inference_server_Server_Engine_Validation — Prerequisite step
Principle:Triton_inference_server_Server_TRT_LLM_Server_Launch — Next step: server launch

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment