Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Workflow:Triton inference server Server LLM Deployment With TRT LLM

From Leeroopedia
Knowledge Sources
Domains LLMs, Model_Serving, Inference, TensorRT
Last Updated 2026-02-13 17:00 GMT

Overview

End-to-end process for building a TensorRT-LLM engine from a Hugging Face model, deploying it on Triton Inference Server with the TensorRT-LLM backend, and benchmarking performance with GenAI-Perf.

Description

This workflow covers the complete pipeline for deploying Large Language Models (LLMs) for high-performance inference production serving. It begins with installing TensorRT-LLM and converting model weights from Hugging Face Transformers format to TensorRT-LLM checkpoint format, then building optimized TensorRT engines with configurable precision, batch sizes, and sequence lengths. The built engines are placed into the TensorRT-LLM backend model repository with properly configured preprocessing, inference, and postprocessing ensemble steps. Finally, the workflow includes benchmarking with GenAI-Perf to measure throughput and latency under realistic workloads.

Usage

Execute this workflow when you need to deploy an LLM (such as Phi-3, Llama, GPT, or similar) for production inference serving with high throughput and low latency on NVIDIA GPUs. This is the recommended path for serving LLMs that require optimized performance through TensorRT acceleration and features like in-flight batching, KV cache management, and streaming responses.

Execution Steps

Step 1: Set up the TensorRT_LLM environment

Launch a CUDA development Docker container and install TensorRT-LLM with its dependencies. This provides the tools needed to convert model weights and build TensorRT engines. The environment requires Python 3.10 and access to NVIDIA GPUs.

Key considerations:

  • Use a compatible CUDA container version that matches the TensorRT-LLM release
  • Install from PyPI using the NVIDIA extra index URL
  • Verify installation by importing the tensorrt_llm module

Step 2: Download the model weights

Retrieve the pre-trained model weights from Hugging Face using git-lfs. This downloads the full model including tokenizer files, which are needed both for engine building and for preprocessing/postprocessing in the Triton deployment.

Key considerations:

  • Ensure git-lfs is installed for downloading large model files
  • Some models may require Hugging Face authentication
  • The tokenizer files from this download are needed later for the Triton ensemble

Step 3: Convert weights to TensorRT_LLM format

Run the model-specific conversion script to transform Hugging Face Transformers weights into the TensorRT-LLM checkpoint format. This step handles weight reorganization, precision conversion, and any model-architecture-specific transformations required by the TensorRT-LLM engine builder.

Key considerations:

  • Use the correct conversion script for your model architecture
  • Specify the target precision (float16, bfloat16, or int8/int4 for quantization)
  • Output is a checkpoint directory used as input for the engine build step

Step 4: Build the TensorRT engine

Use trtllm-build to compile the checkpoint into optimized TensorRT engine files. Configure maximum batch size, input length, output sequence length, and tensor/pipeline parallelism settings. Enable performance plugins (e.g., GEMM plugin) to improve runtime throughput.

Key considerations:

  • Engine build is hardware-specific (must match deployment GPU)
  • max_batch_size, max_input_len, and max_seq_len affect memory allocation
  • tp_size and pp_size control multi-GPU parallelism
  • Enable gemm_plugin for improved performance

Step 5: Validate the engine locally

Run a quick inference test using the TensorRT-LLM runtime to verify the engine produces correct outputs before deploying to Triton. This catches any conversion or build issues early.

Key considerations:

  • Provide the tokenizer directory for proper text encoding/decoding
  • Check output coherence and formatting
  • Optionally run a summarization benchmark for quantitative validation

Step 6: Set up the Triton model repository

Clone the tensorrtllm_backend repository and copy the built engine files into the model repository structure. The repository uses an ensemble pattern with four components: preprocessing (tokenization), tensorrt_llm (engine execution), postprocessing (detokenization), and an ensemble coordinator that chains them together.

Key considerations:

  • Copy engine files to the tensorrt_llm/1/ directory
  • Update config.pbtxt files for all four components (ensemble, preprocessing, postprocessing, tensorrt_llm)
  • Set tokenizer paths, batch sizes, batching strategy, and KV cache parameters
  • Remove the BLS directory if not using Business Logic Scripting

Step 7: Launch Triton with the TRT_LLM backend

Start the Triton Inference Server using the TensorRT-LLM-specific Docker image with the configured model repository mounted. The server loads the ensemble pipeline and exposes HTTP and gRPC endpoints for text generation requests.

Key considerations:

  • Use the trtllm-python-py3 variant of the Triton container
  • Mount both the model repository and tokenizer directory
  • Set --shm-size appropriately for large models
  • Use the launch_triton_server.py script for multi-GPU configurations

Step 8: Send generation requests

Send text generation requests to the deployed ensemble model using the generate endpoint. Requests include the input text and generation parameters (max_tokens, temperature, stop_words). The server returns generated text through the preprocessing-inference-postprocessing pipeline.

Key considerations:

  • Use the /v2/models/ensemble/generate endpoint for HTTP requests
  • Configure generation parameters per request (max_tokens, temperature, top_k, top_p)
  • Streaming responses are supported via Server-Sent Events

Step 9: Benchmark with GenAI_Perf

Use GenAI-Perf from the Triton SDK container to measure throughput and latency under controlled workloads. Configure input/output token lengths, concurrency levels, and tokenizer settings to simulate realistic production traffic patterns.

Key considerations:

  • Run from the Triton SDK container which includes GenAI-Perf
  • Set synthetic input/output token lengths to match expected workload
  • Adjust concurrency to test different load levels
  • Use the tokenizer for accurate token-level metrics

Execution Diagram

GitHub URL

Workflow Repository