Environment:Triton inference server Server TRT LLM Deployment
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Infrastructure |
| Last Updated | 2026-02-13 17:00 GMT |
Overview
GPU environment with TensorRT-LLM, Python, and Git LFS for deploying large language models on Triton Inference Server.
Description
This environment provides the complete runtime stack for deploying LLMs via the TensorRT-LLM backend on Triton. It extends the base GPU CUDA runtime with TensorRT-LLM libraries, HuggingFace model download tools, and engine build utilities. The environment requires specific CUDA path variables (TRITON_CUDACRT_PATH, TRITON_CUDART_PATH, etc.) that are automatically set when the TensorRT-LLM or vLLM backends are included in the container build.
Usage
Use this environment for the LLM Deployment with TRT-LLM workflow. Required for weight conversion, TensorRT engine building, fill-template configuration, and serving LLMs via the Triton generate endpoint. Also applies when using the vLLM backend.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Ubuntu 22.04 LTS (container base) | Via NGC Triton container |
| Hardware | NVIDIA GPU with 16GB+ VRAM | A100 40GB/80GB or H100 recommended for production LLM serving |
| CUDA | CUDA 12.x | Bundled in NGC container |
| Disk | 100GB+ SSD | Model weights + TensorRT engines can be very large |
| RAM | 32GB+ | Engine build is memory-intensive |
Dependencies
Python Packages
- tensorrt_llm (TensorRT-LLM library, installed via pip)
- transformers (HuggingFace model loading)
- torch (PyTorch, bundled in container)
- sentencepiece (tokenizer support)
System Packages
- git-lfs (for downloading large model files from HuggingFace)
- CUDA toolkit (bundled)
- cuDNN (bundled)
- TensorRT runtime (bundled)
Environment Variables (set automatically in container)
- TRITON_CUDACRT_PATH: /usr/local/cuda/include
- TRITON_CUDART_PATH: /usr/local/cuda/include
- TRITON_CUOBJDUMP_PATH: /usr/local/cuda/bin/cuobjdump
- TRITON_CUPTI_PATH: /usr/local/cuda/include
- TRITON_NVDISASM_PATH: /usr/local/cuda/bin/nvdisasm
- TCMALLOC_RELEASE_RATE: 200
Credentials
- HF_TOKEN: HuggingFace API token (required for gated model downloads)
Quick Install
# Use the official Triton TRT-LLM container
docker pull nvcr.io/nvidia/tritonserver:26.01-trtllm-python-py3
# Or install TensorRT-LLM in a Triton container
pip install tensorrt_llm
# Install git-lfs for model downloads
apt-get install -y git-lfs
git lfs install
Code Evidence
CUDA path environment variables set for TRT-LLM/vLLM backends from build.py:1508-1512:
# When tensorrtllm or vllm backends included:
# TRITON_CUDACRT_PATH: /usr/local/cuda/include
# TRITON_CUDART_PATH: /usr/local/cuda/include
# TRITON_CUOBJDUMP_PATH: /usr/local/cuda/bin/cuobjdump
# TRITON_CUPTI_PATH: /usr/local/cuda/include
# TRITON_NVDISASM_PATH: /usr/local/cuda/bin/nvdisasm
vLLM version from build.py:81:
"vllm_version": "0.13.0",
tcmalloc release rate from Dockerfile.sdk:140-246:
ENV TCMALLOC_RELEASE_RATE=200
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| git-lfs not found | Git LFS not installed | apt-get install git-lfs && git lfs install |
| CUDA out of memory during engine build | Insufficient GPU VRAM for engine compilation | Use a GPU with more VRAM, or reduce model precision/size |
| HuggingFace 401 Unauthorized | Missing or invalid HF token for gated models | Set HF_TOKEN environment variable with a valid read token |
| tensorrt_llm not found | TRT-LLM not installed | pip install tensorrt_llm or use the TRT-LLM specific container image |
Compatibility Notes
- Multi-GPU: TRT-LLM supports tensor parallelism across multiple GPUs. The launch_triton_server.py script configures MPI for multi-GPU serving.
- Engine portability: TensorRT engines are not portable across GPU architectures. An engine built on A100 will not run on H100. Rebuild engines for each target GPU.
- vLLM alternative: The vLLM backend (version 0.13.0) provides an alternative to TRT-LLM that does not require an engine build step, but may have different performance characteristics.
Related Pages
- Implementation:Triton_inference_server_Server_Pip_Install_Tensorrt_LLM
- Implementation:Triton_inference_server_Server_Git_LFS_Clone
- Implementation:Triton_inference_server_Server_Convert_Checkpoint
- Implementation:Triton_inference_server_Server_Trtllm_Build
- Implementation:Triton_inference_server_Server_TRT_LLM_Run
- Implementation:Triton_inference_server_Server_Fill_Template
- Implementation:Triton_inference_server_Server_Launch_Triton_Server_Script
- Implementation:Triton_inference_server_Server_HTTP_Generate_Endpoint
- Implementation:Triton_inference_server_Server_GenAI_Perf