Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Triton inference server Server TRT LLM Server Launch

From Leeroopedia

Metadata

Field Value
Type Principle
Principle_type External Tool Doc
Workflow LLM_Deployment_With_TRT_LLM
Repo Triton_inference_server_Server
Source docs/getting_started/llm.md:L277-288
Domains MLOps, NLP, GPU_Computing
Knowledge_Sources TRT-LLM Docs|https://nvidia.github.io/TensorRT-LLM/, source::Repo|Triton Server|https://github.com/triton-inference-server/server
implemented_by Implementation:Triton_inference_server_Server_Launch_Triton_Server_Script
2026-02-13 17:00 GMT

Overview

Process of launching Triton with a TensorRT-LLM backend inside a specialized NGC container with multi-GPU support.

Description

TRT-LLM models require the specialized tritonserver:*-trtllm-python-py3 container image and a launch script that handles multi-GPU coordination via MPI. This differs from standard Triton launch because it needs GPU memory management and world_size configuration.

The launch process involves:

  • Container selection — Using the TRT-LLM-specific Triton container image from NVIDIA NGC, which includes the TRT-LLM backend plugin, MPI runtime, and CUDA libraries
  • Docker configuration — Setting appropriate Docker flags for GPU access (--gpus all), network mode (--network host), and shared memory (--shm-size)
  • Volume mounting — Mounting the model repository and engine directories into the container at known paths
  • MPI coordination — The launch script spawns Triton processes across GPUs using MPI, with world_size controlling the total number of GPU processes

The TRT-LLM Triton container differs from the standard Triton container in several ways:

  • Includes the TensorRT-LLM backend shared library
  • Pre-installs MPI for multi-GPU coordination
  • Includes Python dependencies for preprocessing/postprocessing model scripts
  • Ships with CUDA and cuDNN versions matched to TRT-LLM requirements

Usage

This principle is applied after the model repository is fully configured. It is the step that brings the serving system online and makes it available for client requests.

Workflow context:

Theoretical Basis

Multi-GPU inference coordination:

MPI-based process spawning with tensor parallelism and shared GPU memory management

Key concepts:

  • world_size — Total number of GPU processes. Must match tp_size * pp_size from the engine build step. Each process loads its portion of the sharded model
  • MPI process management — The launch script uses mpirun to spawn one Triton process per GPU, with each process loading the corresponding engine shard
  • Shared memory (--shm-size) — Docker shared memory must be increased from the default (64MB) to accommodate inter-process communication between MPI ranks and the TRT-LLM runtime's internal buffers
  • Network host mode — Using --network host simplifies multi-GPU communication and exposes Triton's HTTP (8000), gRPC (8001), and metrics (8002) ports directly on the host

Server readiness is indicated when all ensemble models report READY status:

+-------------------+---------+--------+
| Model             | Version | Status |
+-------------------+---------+--------+
| ensemble          | 1       | READY  |
| postprocessing    | 1       | READY  |
| preprocessing     | 1       | READY  |
| tensorrt_llm      | 1       | READY  |
+-------------------+---------+--------+

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment