Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Triton inference server Server TRT LLM Model Repository Setup

From Leeroopedia

Metadata

Field Value
Type Principle
Principle_type External Tool Doc
Workflow LLM_Deployment_With_TRT_LLM
Repo Triton_inference_server_Server
Source docs/getting_started/llm.md:L150-260
Domains MLOps, NLP, Model_Serving
Knowledge_Sources TRT-LLM Docs|https://nvidia.github.io/TensorRT-LLM/, source::Repo|Triton Server|https://github.com/triton-inference-server/server
implemented_by Implementation:Triton_inference_server_Server_Fill_Template
2026-02-13 17:00 GMT

Overview

Process of configuring a multi-component model repository for LLM ensemble serving with preprocessing, inference, and postprocessing stages.

Description

LLM deployment on Triton requires an ensemble pipeline with separate config.pbtxt files for preprocessing (tokenization), the TensorRT-LLM engine, postprocessing (detokenization), and an ensemble coordinator. The fill_template.py tool populates template config files with deployment-specific values.

The ensemble pipeline consists of four components:

  • Preprocessing model — Handles tokenization of input text into token IDs. Configured with tokenizer type and path, and maps text inputs to the tensor format expected by the TRT-LLM engine
  • TensorRT-LLM model — The core inference engine that runs the compiled TRT engine. Configured with engine directory path, batching strategy, KV cache settings, and GPU memory management parameters
  • Postprocessing model — Handles detokenization of output token IDs back into text. Configured with the same tokenizer as preprocessing
  • Ensemble model — Coordinates the data flow between preprocessing, inference, and postprocessing. Defines the input/output tensor mappings and execution order

Each component has its own config.pbtxt file in the model repository directory structure:

all_models/inflight_batcher_llm/
  preprocessing/
    config.pbtxt
    1/
      model.py
  tensorrt_llm/
    config.pbtxt
    1/
  postprocessing/
    config.pbtxt
    1/
      model.py
  ensemble/
    config.pbtxt
    1/

Usage

This principle is applied after engine validation and before server launch. The model repository is the primary artifact consumed by Triton Inference Server at startup.

Workflow context:

Theoretical Basis

Ensemble pipeline pattern:

preprocessing → engine → postprocessing

Coordinated by ensemble config. Each component has distinct configuration parameters.

The ensemble pattern provides several advantages for LLM serving:

  • Separation of concerns — Tokenization, inference, and detokenization are independently configurable and upgradeable
  • Batching optimization — The TRT-LLM backend can apply inflight batching (continuous batching) to maximize GPU utilization across multiple concurrent requests
  • KV cache management — The kv_cache_free_gpu_mem_fraction parameter controls how much GPU memory is reserved for the KV cache, balancing between batch size capacity and context length
  • Decoupled mode — Enables streaming token-by-token output, where the model sends partial responses as tokens are generated rather than waiting for the full sequence

Key configuration parameters:

Parameter Component Description
triton_max_batch_size All Maximum batch size for Triton scheduling
tokenizer_type Preprocessing Tokenizer implementation type (e.g., auto)
tokenizer_dir Preprocessing/Postprocessing Path to HuggingFace tokenizer files
decoupled_mode TensorRT-LLM Enable streaming output (true/false)
engine_dir TensorRT-LLM Path to compiled engine directory
batching_strategy TensorRT-LLM Batching mode (inflight_fused_batching)
kv_cache_free_gpu_mem_fraction TensorRT-LLM Fraction of free GPU memory for KV cache

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment