Workflow:Triton inference server Server LLM Deployment With TRT LLM

Knowledge Sources	Triton Inference Server Triton LLM Guide TensorRT-LLM Backend TensorRT-LLM
Domains	LLMs, Model_Serving, Inference, TensorRT
Last Updated	2026-02-13 17:00 GMT

Overview

End-to-end process for building a TensorRT-LLM engine from a Hugging Face model, deploying it on Triton Inference Server with the TensorRT-LLM backend, and benchmarking performance with GenAI-Perf.

Description

This workflow covers the complete pipeline for deploying Large Language Models (LLMs) for high-performance inference production serving. It begins with installing TensorRT-LLM and converting model weights from Hugging Face Transformers format to TensorRT-LLM checkpoint format, then building optimized TensorRT engines with configurable precision, batch sizes, and sequence lengths. The built engines are placed into the TensorRT-LLM backend model repository with properly configured preprocessing, inference, and postprocessing ensemble steps. Finally, the workflow includes benchmarking with GenAI-Perf to measure throughput and latency under realistic workloads.

Usage

Execute this workflow when you need to deploy an LLM (such as Phi-3, Llama, GPT, or similar) for production inference serving with high throughput and low latency on NVIDIA GPUs. This is the recommended path for serving LLMs that require optimized performance through TensorRT acceleration and features like in-flight batching, KV cache management, and streaming responses.

Execution Steps

Step 1: Set up the TensorRT_LLM environment

Launch a CUDA development Docker container and install TensorRT-LLM with its dependencies. This provides the tools needed to convert model weights and build TensorRT engines. The environment requires Python 3.10 and access to NVIDIA GPUs.

Key considerations:

Use a compatible CUDA container version that matches the TensorRT-LLM release
Install from PyPI using the NVIDIA extra index URL
Verify installation by importing the tensorrt_llm module

Step 2: Download the model weights

Retrieve the pre-trained model weights from Hugging Face using git-lfs. This downloads the full model including tokenizer files, which are needed both for engine building and for preprocessing/postprocessing in the Triton deployment.

Key considerations:

Ensure git-lfs is installed for downloading large model files
Some models may require Hugging Face authentication
The tokenizer files from this download are needed later for the Triton ensemble

Step 3: Convert weights to TensorRT_LLM format

Run the model-specific conversion script to transform Hugging Face Transformers weights into the TensorRT-LLM checkpoint format. This step handles weight reorganization, precision conversion, and any model-architecture-specific transformations required by the TensorRT-LLM engine builder.

Key considerations:

Use the correct conversion script for your model architecture
Specify the target precision (float16, bfloat16, or int8/int4 for quantization)
Output is a checkpoint directory used as input for the engine build step

Step 4: Build the TensorRT engine

Use trtllm-build to compile the checkpoint into optimized TensorRT engine files. Configure maximum batch size, input length, output sequence length, and tensor/pipeline parallelism settings. Enable performance plugins (e.g., GEMM plugin) to improve runtime throughput.

Key considerations:

Engine build is hardware-specific (must match deployment GPU)
max_batch_size, max_input_len, and max_seq_len affect memory allocation
tp_size and pp_size control multi-GPU parallelism
Enable gemm_plugin for improved performance

Step 5: Validate the engine locally

Run a quick inference test using the TensorRT-LLM runtime to verify the engine produces correct outputs before deploying to Triton. This catches any conversion or build issues early.

Key considerations:

Provide the tokenizer directory for proper text encoding/decoding
Check output coherence and formatting
Optionally run a summarization benchmark for quantitative validation

Step 6: Set up the Triton model repository

Clone the tensorrtllm_backend repository and copy the built engine files into the model repository structure. The repository uses an ensemble pattern with four components: preprocessing (tokenization), tensorrt_llm (engine execution), postprocessing (detokenization), and an ensemble coordinator that chains them together.

Key considerations:

Copy engine files to the tensorrt_llm/1/ directory
Update config.pbtxt files for all four components (ensemble, preprocessing, postprocessing, tensorrt_llm)
Set tokenizer paths, batch sizes, batching strategy, and KV cache parameters
Remove the BLS directory if not using Business Logic Scripting

Step 7: Launch Triton with the TRT_LLM backend

Start the Triton Inference Server using the TensorRT-LLM-specific Docker image with the configured model repository mounted. The server loads the ensemble pipeline and exposes HTTP and gRPC endpoints for text generation requests.

Key considerations:

Use the trtllm-python-py3 variant of the Triton container
Mount both the model repository and tokenizer directory
Set --shm-size appropriately for large models
Use the launch_triton_server.py script for multi-GPU configurations

Step 8: Send generation requests

Send text generation requests to the deployed ensemble model using the generate endpoint. Requests include the input text and generation parameters (max_tokens, temperature, stop_words). The server returns generated text through the preprocessing-inference-postprocessing pipeline.

Key considerations:

Use the /v2/models/ensemble/generate endpoint for HTTP requests
Configure generation parameters per request (max_tokens, temperature, top_k, top_p)
Streaming responses are supported via Server-Sent Events

Step 9: Benchmark with GenAI_Perf

Use GenAI-Perf from the Triton SDK container to measure throughput and latency under controlled workloads. Configure input/output token lengths, concurrency levels, and tokenizer settings to simulate realistic production traffic patterns.

Key considerations:

Run from the Triton SDK container which includes GenAI-Perf
Set synthetic input/output token lengths to match expected workload
Adjust concurrency to test different load levels
Use the tokenizer for accurate token-level metrics

Execution Diagram

GitHub URL

Workflow Repository