Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Environment:Triton inference server Server TRT LLM Deployment

From Leeroopedia
Knowledge Sources
Domains LLMs, Infrastructure
Last Updated 2026-02-13 17:00 GMT

Overview

GPU environment with TensorRT-LLM, Python, and Git LFS for deploying large language models on Triton Inference Server.

Description

This environment provides the complete runtime stack for deploying LLMs via the TensorRT-LLM backend on Triton. It extends the base GPU CUDA runtime with TensorRT-LLM libraries, HuggingFace model download tools, and engine build utilities. The environment requires specific CUDA path variables (TRITON_CUDACRT_PATH, TRITON_CUDART_PATH, etc.) that are automatically set when the TensorRT-LLM or vLLM backends are included in the container build.

Usage

Use this environment for the LLM Deployment with TRT-LLM workflow. Required for weight conversion, TensorRT engine building, fill-template configuration, and serving LLMs via the Triton generate endpoint. Also applies when using the vLLM backend.

System Requirements

Category Requirement Notes
OS Ubuntu 22.04 LTS (container base) Via NGC Triton container
Hardware NVIDIA GPU with 16GB+ VRAM A100 40GB/80GB or H100 recommended for production LLM serving
CUDA CUDA 12.x Bundled in NGC container
Disk 100GB+ SSD Model weights + TensorRT engines can be very large
RAM 32GB+ Engine build is memory-intensive

Dependencies

Python Packages

  • tensorrt_llm (TensorRT-LLM library, installed via pip)
  • transformers (HuggingFace model loading)
  • torch (PyTorch, bundled in container)
  • sentencepiece (tokenizer support)

System Packages

  • git-lfs (for downloading large model files from HuggingFace)
  • CUDA toolkit (bundled)
  • cuDNN (bundled)
  • TensorRT runtime (bundled)

Environment Variables (set automatically in container)

  • TRITON_CUDACRT_PATH: /usr/local/cuda/include
  • TRITON_CUDART_PATH: /usr/local/cuda/include
  • TRITON_CUOBJDUMP_PATH: /usr/local/cuda/bin/cuobjdump
  • TRITON_CUPTI_PATH: /usr/local/cuda/include
  • TRITON_NVDISASM_PATH: /usr/local/cuda/bin/nvdisasm
  • TCMALLOC_RELEASE_RATE: 200

Credentials

  • HF_TOKEN: HuggingFace API token (required for gated model downloads)

Quick Install

# Use the official Triton TRT-LLM container
docker pull nvcr.io/nvidia/tritonserver:26.01-trtllm-python-py3

# Or install TensorRT-LLM in a Triton container
pip install tensorrt_llm

# Install git-lfs for model downloads
apt-get install -y git-lfs
git lfs install

Code Evidence

CUDA path environment variables set for TRT-LLM/vLLM backends from build.py:1508-1512:

# When tensorrtllm or vllm backends included:
# TRITON_CUDACRT_PATH: /usr/local/cuda/include
# TRITON_CUDART_PATH: /usr/local/cuda/include
# TRITON_CUOBJDUMP_PATH: /usr/local/cuda/bin/cuobjdump
# TRITON_CUPTI_PATH: /usr/local/cuda/include
# TRITON_NVDISASM_PATH: /usr/local/cuda/bin/nvdisasm

vLLM version from build.py:81:

"vllm_version": "0.13.0",

tcmalloc release rate from Dockerfile.sdk:140-246:

ENV TCMALLOC_RELEASE_RATE=200

Common Errors

Error Message Cause Solution
git-lfs not found Git LFS not installed apt-get install git-lfs && git lfs install
CUDA out of memory during engine build Insufficient GPU VRAM for engine compilation Use a GPU with more VRAM, or reduce model precision/size
HuggingFace 401 Unauthorized Missing or invalid HF token for gated models Set HF_TOKEN environment variable with a valid read token
tensorrt_llm not found TRT-LLM not installed pip install tensorrt_llm or use the TRT-LLM specific container image

Compatibility Notes

  • Multi-GPU: TRT-LLM supports tensor parallelism across multiple GPUs. The launch_triton_server.py script configures MPI for multi-GPU serving.
  • Engine portability: TensorRT engines are not portable across GPU architectures. An engine built on A100 will not run on H100. Rebuild engines for each target GPU.
  • vLLM alternative: The vLLM backend (version 0.13.0) provides an alternative to TRT-LLM that does not require an engine build step, but may have different performance characteristics.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment