Workflow:Triton inference server Server LLM Deployment With TRT LLM
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Model_Serving, Inference, TensorRT |
| Last Updated | 2026-02-13 17:00 GMT |
Overview
End-to-end process for building a TensorRT-LLM engine from a Hugging Face model, deploying it on Triton Inference Server with the TensorRT-LLM backend, and benchmarking performance with GenAI-Perf.
Description
This workflow covers the complete pipeline for deploying Large Language Models (LLMs) for high-performance inference production serving. It begins with installing TensorRT-LLM and converting model weights from Hugging Face Transformers format to TensorRT-LLM checkpoint format, then building optimized TensorRT engines with configurable precision, batch sizes, and sequence lengths. The built engines are placed into the TensorRT-LLM backend model repository with properly configured preprocessing, inference, and postprocessing ensemble steps. Finally, the workflow includes benchmarking with GenAI-Perf to measure throughput and latency under realistic workloads.
Usage
Execute this workflow when you need to deploy an LLM (such as Phi-3, Llama, GPT, or similar) for production inference serving with high throughput and low latency on NVIDIA GPUs. This is the recommended path for serving LLMs that require optimized performance through TensorRT acceleration and features like in-flight batching, KV cache management, and streaming responses.
Execution Steps
Step 1: Set up the TensorRT_LLM environment
Launch a CUDA development Docker container and install TensorRT-LLM with its dependencies. This provides the tools needed to convert model weights and build TensorRT engines. The environment requires Python 3.10 and access to NVIDIA GPUs.
Key considerations:
- Use a compatible CUDA container version that matches the TensorRT-LLM release
- Install from PyPI using the NVIDIA extra index URL
- Verify installation by importing the tensorrt_llm module
Step 2: Download the model weights
Retrieve the pre-trained model weights from Hugging Face using git-lfs. This downloads the full model including tokenizer files, which are needed both for engine building and for preprocessing/postprocessing in the Triton deployment.
Key considerations:
- Ensure git-lfs is installed for downloading large model files
- Some models may require Hugging Face authentication
- The tokenizer files from this download are needed later for the Triton ensemble
Step 3: Convert weights to TensorRT_LLM format
Run the model-specific conversion script to transform Hugging Face Transformers weights into the TensorRT-LLM checkpoint format. This step handles weight reorganization, precision conversion, and any model-architecture-specific transformations required by the TensorRT-LLM engine builder.
Key considerations:
- Use the correct conversion script for your model architecture
- Specify the target precision (float16, bfloat16, or int8/int4 for quantization)
- Output is a checkpoint directory used as input for the engine build step
Step 4: Build the TensorRT engine
Use trtllm-build to compile the checkpoint into optimized TensorRT engine files. Configure maximum batch size, input length, output sequence length, and tensor/pipeline parallelism settings. Enable performance plugins (e.g., GEMM plugin) to improve runtime throughput.
Key considerations:
- Engine build is hardware-specific (must match deployment GPU)
- max_batch_size, max_input_len, and max_seq_len affect memory allocation
- tp_size and pp_size control multi-GPU parallelism
- Enable gemm_plugin for improved performance
Step 5: Validate the engine locally
Run a quick inference test using the TensorRT-LLM runtime to verify the engine produces correct outputs before deploying to Triton. This catches any conversion or build issues early.
Key considerations:
- Provide the tokenizer directory for proper text encoding/decoding
- Check output coherence and formatting
- Optionally run a summarization benchmark for quantitative validation
Step 6: Set up the Triton model repository
Clone the tensorrtllm_backend repository and copy the built engine files into the model repository structure. The repository uses an ensemble pattern with four components: preprocessing (tokenization), tensorrt_llm (engine execution), postprocessing (detokenization), and an ensemble coordinator that chains them together.
Key considerations:
- Copy engine files to the tensorrt_llm/1/ directory
- Update config.pbtxt files for all four components (ensemble, preprocessing, postprocessing, tensorrt_llm)
- Set tokenizer paths, batch sizes, batching strategy, and KV cache parameters
- Remove the BLS directory if not using Business Logic Scripting
Step 7: Launch Triton with the TRT_LLM backend
Start the Triton Inference Server using the TensorRT-LLM-specific Docker image with the configured model repository mounted. The server loads the ensemble pipeline and exposes HTTP and gRPC endpoints for text generation requests.
Key considerations:
- Use the trtllm-python-py3 variant of the Triton container
- Mount both the model repository and tokenizer directory
- Set --shm-size appropriately for large models
- Use the launch_triton_server.py script for multi-GPU configurations
Step 8: Send generation requests
Send text generation requests to the deployed ensemble model using the generate endpoint. Requests include the input text and generation parameters (max_tokens, temperature, stop_words). The server returns generated text through the preprocessing-inference-postprocessing pipeline.
Key considerations:
- Use the /v2/models/ensemble/generate endpoint for HTTP requests
- Configure generation parameters per request (max_tokens, temperature, top_k, top_p)
- Streaming responses are supported via Server-Sent Events
Step 9: Benchmark with GenAI_Perf
Use GenAI-Perf from the Triton SDK container to measure throughput and latency under controlled workloads. Configure input/output token lengths, concurrency levels, and tokenizer settings to simulate realistic production traffic patterns.
Key considerations:
- Run from the Triton SDK container which includes GenAI-Perf
- Set synthetic input/output token lengths to match expected workload
- Adjust concurrency to test different load levels
- Use the tokenizer for accurate token-level metrics