Principle:Triton inference server Server TRT LLM Server Launch
Metadata
| Field | Value |
|---|---|
| Type | Principle |
| Principle_type | External Tool Doc |
| Workflow | LLM_Deployment_With_TRT_LLM |
| Repo | Triton_inference_server_Server |
| Source | docs/getting_started/llm.md:L277-288 |
| Domains | MLOps, NLP, GPU_Computing |
| Knowledge_Sources | TRT-LLM Docs|https://nvidia.github.io/TensorRT-LLM/, source::Repo|Triton Server|https://github.com/triton-inference-server/server |
| implemented_by | Implementation:Triton_inference_server_Server_Launch_Triton_Server_Script |
| 2026-02-13 17:00 GMT |
Overview
Process of launching Triton with a TensorRT-LLM backend inside a specialized NGC container with multi-GPU support.
Description
TRT-LLM models require the specialized tritonserver:*-trtllm-python-py3 container image and a launch script that handles multi-GPU coordination via MPI. This differs from standard Triton launch because it needs GPU memory management and world_size configuration.
The launch process involves:
- Container selection — Using the TRT-LLM-specific Triton container image from NVIDIA NGC, which includes the TRT-LLM backend plugin, MPI runtime, and CUDA libraries
- Docker configuration — Setting appropriate Docker flags for GPU access (
--gpus all), network mode (--network host), and shared memory (--shm-size) - Volume mounting — Mounting the model repository and engine directories into the container at known paths
- MPI coordination — The launch script spawns Triton processes across GPUs using MPI, with
world_sizecontrolling the total number of GPU processes
The TRT-LLM Triton container differs from the standard Triton container in several ways:
- Includes the TensorRT-LLM backend shared library
- Pre-installs MPI for multi-GPU coordination
- Includes Python dependencies for preprocessing/postprocessing model scripts
- Ships with CUDA and cuDNN versions matched to TRT-LLM requirements
Usage
This principle is applied after the model repository is fully configured. It is the step that brings the serving system online and makes it available for client requests.
Workflow context:
- Precedes: Principle:Triton_inference_server_Server_Generate_API, Principle:Triton_inference_server_Server_LLM_Benchmarking
- Depends on: Principle:Triton_inference_server_Server_TRT_LLM_Model_Repository_Setup
Theoretical Basis
Multi-GPU inference coordination:
MPI-based process spawning with tensor parallelism and shared GPU memory management
Key concepts:
- world_size — Total number of GPU processes. Must match
tp_size * pp_sizefrom the engine build step. Each process loads its portion of the sharded model - MPI process management — The launch script uses
mpirunto spawn one Triton process per GPU, with each process loading the corresponding engine shard - Shared memory (--shm-size) — Docker shared memory must be increased from the default (64MB) to accommodate inter-process communication between MPI ranks and the TRT-LLM runtime's internal buffers
- Network host mode — Using
--network hostsimplifies multi-GPU communication and exposes Triton's HTTP (8000), gRPC (8001), and metrics (8002) ports directly on the host
Server readiness is indicated when all ensemble models report READY status:
+-------------------+---------+--------+
| Model | Version | Status |
+-------------------+---------+--------+
| ensemble | 1 | READY |
| postprocessing | 1 | READY |
| preprocessing | 1 | READY |
| tensorrt_llm | 1 | READY |
+-------------------+---------+--------+
Related Pages
- Implementation:Triton_inference_server_Server_Launch_Triton_Server_Script
- Principle:Triton_inference_server_Server_TRT_LLM_Model_Repository_Setup — Prerequisite step
- Principle:Triton_inference_server_Server_Generate_API — Client-facing API after launch
- Principle:Triton_inference_server_Server_LLM_Benchmarking — Performance testing after launch