Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Triton inference server Server Launch Triton Server Script

From Leeroopedia

Metadata

Field Value
Type Implementation
Workflow LLM_Deployment_With_TRT_LLM
Repo Triton_inference_server_Server
Source docs/getting_started/llm.md:L277-288
Domains MLOps, NLP, GPU_Computing
Knowledge_Sources TRT-LLM Docs|https://nvidia.github.io/TensorRT-LLM/, source::Repo|Triton Server|https://github.com/triton-inference-server/server
implements Principle:Triton_inference_server_Server_TRT_LLM_Server_Launch
2026-02-13 17:00 GMT

Overview

Concrete TRT-LLM specific server launch procedure using Docker and the tensorrtllm_backend launch script. This implementation covers the Docker container setup, volume mounting, and the launch_triton_server.py script invocation.

Description

The launch procedure has two phases:

  1. Docker container start — Start an interactive container from the TRT-LLM Triton image with appropriate GPU, network, and shared memory settings, mounting the model repository
  2. Server launch — Inside the container, run the launch_triton_server.py script which handles MPI process spawning for multi-GPU support

The launch_triton_server.py script is provided by the tensorrtllm_backend repository and wraps the standard tritonserver binary with MPI coordination logic.

Usage

Run after the model repository is fully configured with fill_template.py. Requires the NVIDIA Container Toolkit (nvidia-docker2) for GPU access inside Docker.

Code Reference

Source Location

Item Value
File docs/getting_started/llm.md
Lines L277-288
Repo https://github.com/triton-inference-server/server
Launch script scripts/launch_triton_server.py (in tensorrtllm_backend repo)

Signature

# Step 1: Start Docker container
docker run -it --rm \
    --gpus all \
    --network host \
    --shm-size=1g \
    -v /path/to/all_models:/opt/all_models \
    -v /path/to/engines:/opt/engines \
    -v /path/to/tokenizer:/opt/tokenizer \
    nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3

# Step 2: Inside the container, launch Triton
python3 scripts/launch_triton_server.py \
    --model_repo /opt/all_models/inflight_batcher_llm \
    --world_size 1

Import / Verification

# Verify server is running and models are loaded
curl -s localhost:8000/v2/health/ready
# Should return HTTP 200

curl -s localhost:8000/v2/models | python3 -m json.tool
# Should list ensemble, preprocessing, tensorrt_llm, postprocessing models

I/O Contract

Inputs

Name Type Description
--model_repo Directory path Path to configured model repository inside the container
--world_size Integer Number of GPU processes (must match tp_size * pp_size from engine build)
Docker --gpus all Docker flag Grants GPU access to the container
Docker --network host Docker flag Uses host networking for port exposure (8000, 8001, 8002)
Docker --shm-size=1g Docker flag Sets shared memory size for MPI inter-process communication
NGC container image Docker image nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3
Configured model repository Directory Model repo with populated config.pbtxt files from fill_template.py

Outputs

Name Type Description
HTTP endpoint Service Triton HTTP API at http://localhost:8000
gRPC endpoint Service Triton gRPC API at localhost:8001
Metrics endpoint Service Prometheus metrics at http://localhost:8002/metrics
Model READY status Log output All four ensemble models (preprocessing, tensorrt_llm, postprocessing, ensemble) report READY

Usage Examples

Single-GPU launch

# Start the container with model repo mounted
docker run -it --rm \
    --gpus all \
    --network host \
    --shm-size=1g \
    -v $(pwd)/all_models:/opt/all_models \
    -v $(pwd)/phi-engine:/opt/phi-engine \
    -v $(pwd)/Phi-3-mini-4k-instruct:/opt/Phi-3-mini-4k-instruct \
    nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3

# Inside the container:
python3 scripts/launch_triton_server.py \
    --model_repo /opt/all_models/inflight_batcher_llm \
    --world_size 1

Multi-GPU launch (2-way tensor parallelism)

# Inside the container:
python3 scripts/launch_triton_server.py \
    --model_repo /opt/all_models/inflight_batcher_llm \
    --world_size 2

Verify server readiness

# Check health endpoint
curl -s localhost:8000/v2/health/ready

# Check loaded models
curl -s localhost:8000/v2/models | python3 -m json.tool

# Check Prometheus metrics
curl -s localhost:8002/metrics | head -20

Key Parameters

Parameter Description Example Value
--model_repo Path to model repository inside container /opt/all_models/inflight_batcher_llm
--world_size Total GPU process count 1 (single GPU), 2 (2-way TP)
--gpus all Docker flag for GPU access Required for GPU inference
--network host Docker networking mode Exposes ports 8000, 8001, 8002
--shm-size Docker shared memory size 1g (recommended minimum)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment