Principle:Triton inference server Server TRT LLM Server Launch

Metadata

Field	Value
Type	Principle
Principle_type	External Tool Doc
Workflow	LLM_Deployment_With_TRT_LLM
Repo	Triton_inference_server_Server
Source	docs/getting_started/llm.md:L277-288
Domains	MLOps, NLP, GPU_Computing
Knowledge_Sources	TRT-LLM Docs\|https://nvidia.github.io/TensorRT-LLM/, source::Repo\|Triton Server\|https://github.com/triton-inference-server/server
implemented_by	Implementation:Triton_inference_server_Server_Launch_Triton_Server_Script
2026-02-13 17:00 GMT

Overview

Process of launching Triton with a TensorRT-LLM backend inside a specialized NGC container with multi-GPU support.

Description

TRT-LLM models require the specialized tritonserver:*-trtllm-python-py3 container image and a launch script that handles multi-GPU coordination via MPI. This differs from standard Triton launch because it needs GPU memory management and world_size configuration.

The launch process involves:

Container selection — Using the TRT-LLM-specific Triton container image from NVIDIA NGC, which includes the TRT-LLM backend plugin, MPI runtime, and CUDA libraries
Docker configuration — Setting appropriate Docker flags for GPU access (--gpus all), network mode (--network host), and shared memory (--shm-size)
Volume mounting — Mounting the model repository and engine directories into the container at known paths
MPI coordination — The launch script spawns Triton processes across GPUs using MPI, with world_size controlling the total number of GPU processes

The TRT-LLM Triton container differs from the standard Triton container in several ways:

Includes the TensorRT-LLM backend shared library
Pre-installs MPI for multi-GPU coordination
Includes Python dependencies for preprocessing/postprocessing model scripts
Ships with CUDA and cuDNN versions matched to TRT-LLM requirements

Usage

This principle is applied after the model repository is fully configured. It is the step that brings the serving system online and makes it available for client requests.

Workflow context:

Precedes: Principle:Triton_inference_server_Server_Generate_API, Principle:Triton_inference_server_Server_LLM_Benchmarking
Depends on: Principle:Triton_inference_server_Server_TRT_LLM_Model_Repository_Setup

Theoretical Basis

Multi-GPU inference coordination:

MPI-based process spawning with tensor parallelism and shared GPU memory management

Key concepts:

world_size — Total number of GPU processes. Must match tp_size * pp_size from the engine build step. Each process loads its portion of the sharded model
MPI process management — The launch script uses mpirun to spawn one Triton process per GPU, with each process loading the corresponding engine shard
Shared memory (--shm-size) — Docker shared memory must be increased from the default (64MB) to accommodate inter-process communication between MPI ranks and the TRT-LLM runtime's internal buffers
Network host mode — Using --network host simplifies multi-GPU communication and exposes Triton's HTTP (8000), gRPC (8001), and metrics (8002) ports directly on the host

Server readiness is indicated when all ensemble models report READY status:

+-------------------+---------+--------+
| Model             | Version | Status |
+-------------------+---------+--------+
| ensemble          | 1       | READY  |
| postprocessing    | 1       | READY  |
| preprocessing     | 1       | READY  |
| tensorrt_llm      | 1       | READY  |
+-------------------+---------+--------+

Related Pages

Implementation:Triton_inference_server_Server_Launch_Triton_Server_Script
Principle:Triton_inference_server_Server_TRT_LLM_Model_Repository_Setup — Prerequisite step
Principle:Triton_inference_server_Server_Generate_API — Client-facing API after launch
Principle:Triton_inference_server_Server_LLM_Benchmarking — Performance testing after launch

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment