Implementation:Triton inference server Server Launch Triton Server Script

Metadata

Field	Value
Type	Implementation
Workflow	LLM_Deployment_With_TRT_LLM
Repo	Triton_inference_server_Server
Source	docs/getting_started/llm.md:L277-288
Domains	MLOps, NLP, GPU_Computing
Knowledge_Sources	TRT-LLM Docs\|https://nvidia.github.io/TensorRT-LLM/, source::Repo\|Triton Server\|https://github.com/triton-inference-server/server
implements	Principle:Triton_inference_server_Server_TRT_LLM_Server_Launch
2026-02-13 17:00 GMT

Overview

Concrete TRT-LLM specific server launch procedure using Docker and the tensorrtllm_backend launch script. This implementation covers the Docker container setup, volume mounting, and the launch_triton_server.py script invocation.

Description

The launch procedure has two phases:

Docker container start — Start an interactive container from the TRT-LLM Triton image with appropriate GPU, network, and shared memory settings, mounting the model repository
Server launch — Inside the container, run the launch_triton_server.py script which handles MPI process spawning for multi-GPU support

The launch_triton_server.py script is provided by the tensorrtllm_backend repository and wraps the standard tritonserver binary with MPI coordination logic.

Usage

Run after the model repository is fully configured with fill_template.py. Requires the NVIDIA Container Toolkit (nvidia-docker2) for GPU access inside Docker.

Code Reference

Source Location

Item	Value
File	docs/getting_started/llm.md
Lines	L277-288
Repo	https://github.com/triton-inference-server/server
Launch script	scripts/launch_triton_server.py (in tensorrtllm_backend repo)

Signature

# Step 1: Start Docker container
docker run -it --rm \
    --gpus all \
    --network host \
    --shm-size=1g \
    -v /path/to/all_models:/opt/all_models \
    -v /path/to/engines:/opt/engines \
    -v /path/to/tokenizer:/opt/tokenizer \
    nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3

# Step 2: Inside the container, launch Triton
python3 scripts/launch_triton_server.py \
    --model_repo /opt/all_models/inflight_batcher_llm \
    --world_size 1

Import / Verification

# Verify server is running and models are loaded
curl -s localhost:8000/v2/health/ready
# Should return HTTP 200

curl -s localhost:8000/v2/models | python3 -m json.tool
# Should list ensemble, preprocessing, tensorrt_llm, postprocessing models

I/O Contract

Inputs

Name	Type	Description
`--model_repo`	Directory path	Path to configured model repository inside the container
`--world_size`	Integer	Number of GPU processes (must match `tp_size * pp_size` from engine build)
Docker `--gpus all`	Docker flag	Grants GPU access to the container
Docker `--network host`	Docker flag	Uses host networking for port exposure (8000, 8001, 8002)
Docker `--shm-size=1g`	Docker flag	Sets shared memory size for MPI inter-process communication
NGC container image	Docker image	`nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3`
Configured model repository	Directory	Model repo with populated config.pbtxt files from fill_template.py

Outputs

Name	Type	Description
HTTP endpoint	Service	Triton HTTP API at `http://localhost:8000`
gRPC endpoint	Service	Triton gRPC API at `localhost:8001`
Metrics endpoint	Service	Prometheus metrics at `http://localhost:8002/metrics`
Model READY status	Log output	All four ensemble models (preprocessing, tensorrt_llm, postprocessing, ensemble) report READY

Usage Examples

Single-GPU launch

# Start the container with model repo mounted
docker run -it --rm \
    --gpus all \
    --network host \
    --shm-size=1g \
    -v $(pwd)/all_models:/opt/all_models \
    -v $(pwd)/phi-engine:/opt/phi-engine \
    -v $(pwd)/Phi-3-mini-4k-instruct:/opt/Phi-3-mini-4k-instruct \
    nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3

# Inside the container:
python3 scripts/launch_triton_server.py \
    --model_repo /opt/all_models/inflight_batcher_llm \
    --world_size 1

Multi-GPU launch (2-way tensor parallelism)

# Inside the container:
python3 scripts/launch_triton_server.py \
    --model_repo /opt/all_models/inflight_batcher_llm \
    --world_size 2

Verify server readiness

# Check health endpoint
curl -s localhost:8000/v2/health/ready

# Check loaded models
curl -s localhost:8000/v2/models | python3 -m json.tool

# Check Prometheus metrics
curl -s localhost:8002/metrics | head -20

Key Parameters

Parameter	Description	Example Value
`--model_repo`	Path to model repository inside container	`/opt/all_models/inflight_batcher_llm`
`--world_size`	Total GPU process count	`1` (single GPU), `2` (2-way TP)
`--gpus all`	Docker flag for GPU access	Required for GPU inference
`--network host`	Docker networking mode	Exposes ports 8000, 8001, 8002
`--shm-size`	Docker shared memory size	`1g` (recommended minimum)

Related Pages

Principle:Triton_inference_server_Server_TRT_LLM_Server_Launch
Implementation:Triton_inference_server_Server_Fill_Template — Prerequisite: model repository configuration
Implementation:Triton_inference_server_Server_HTTP_Generate_Endpoint — Client API for sending requests
Implementation:Triton_inference_server_Server_GenAI_Perf — Benchmarking the running server
Environment:Triton_inference_server_Server_TRT_LLM_Deployment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment