Metadata
Overview
Concrete TRT-LLM specific server launch procedure using Docker and the tensorrtllm_backend launch script. This implementation covers the Docker container setup, volume mounting, and the launch_triton_server.py script invocation.
Description
The launch procedure has two phases:
- Docker container start — Start an interactive container from the TRT-LLM Triton image with appropriate GPU, network, and shared memory settings, mounting the model repository
- Server launch — Inside the container, run the
launch_triton_server.py script which handles MPI process spawning for multi-GPU support
The launch_triton_server.py script is provided by the tensorrtllm_backend repository and wraps the standard tritonserver binary with MPI coordination logic.
Usage
Run after the model repository is fully configured with fill_template.py. Requires the NVIDIA Container Toolkit (nvidia-docker2) for GPU access inside Docker.
Code Reference
Source Location
Signature
# Step 1: Start Docker container
docker run -it --rm \
--gpus all \
--network host \
--shm-size=1g \
-v /path/to/all_models:/opt/all_models \
-v /path/to/engines:/opt/engines \
-v /path/to/tokenizer:/opt/tokenizer \
nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3
# Step 2: Inside the container, launch Triton
python3 scripts/launch_triton_server.py \
--model_repo /opt/all_models/inflight_batcher_llm \
--world_size 1
Import / Verification
# Verify server is running and models are loaded
curl -s localhost:8000/v2/health/ready
# Should return HTTP 200
curl -s localhost:8000/v2/models | python3 -m json.tool
# Should list ensemble, preprocessing, tensorrt_llm, postprocessing models
I/O Contract
Inputs
| Name |
Type |
Description
|
--model_repo |
Directory path |
Path to configured model repository inside the container
|
--world_size |
Integer |
Number of GPU processes (must match tp_size * pp_size from engine build)
|
Docker --gpus all |
Docker flag |
Grants GPU access to the container
|
Docker --network host |
Docker flag |
Uses host networking for port exposure (8000, 8001, 8002)
|
Docker --shm-size=1g |
Docker flag |
Sets shared memory size for MPI inter-process communication
|
| NGC container image |
Docker image |
nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3
|
| Configured model repository |
Directory |
Model repo with populated config.pbtxt files from fill_template.py
|
Outputs
| Name |
Type |
Description
|
| HTTP endpoint |
Service |
Triton HTTP API at http://localhost:8000
|
| gRPC endpoint |
Service |
Triton gRPC API at localhost:8001
|
| Metrics endpoint |
Service |
Prometheus metrics at http://localhost:8002/metrics
|
| Model READY status |
Log output |
All four ensemble models (preprocessing, tensorrt_llm, postprocessing, ensemble) report READY
|
Usage Examples
Single-GPU launch
# Start the container with model repo mounted
docker run -it --rm \
--gpus all \
--network host \
--shm-size=1g \
-v $(pwd)/all_models:/opt/all_models \
-v $(pwd)/phi-engine:/opt/phi-engine \
-v $(pwd)/Phi-3-mini-4k-instruct:/opt/Phi-3-mini-4k-instruct \
nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3
# Inside the container:
python3 scripts/launch_triton_server.py \
--model_repo /opt/all_models/inflight_batcher_llm \
--world_size 1
Multi-GPU launch (2-way tensor parallelism)
# Inside the container:
python3 scripts/launch_triton_server.py \
--model_repo /opt/all_models/inflight_batcher_llm \
--world_size 2
Verify server readiness
# Check health endpoint
curl -s localhost:8000/v2/health/ready
# Check loaded models
curl -s localhost:8000/v2/models | python3 -m json.tool
# Check Prometheus metrics
curl -s localhost:8002/metrics | head -20
Key Parameters
| Parameter |
Description |
Example Value
|
--model_repo |
Path to model repository inside container |
/opt/all_models/inflight_batcher_llm
|
--world_size |
Total GPU process count |
1 (single GPU), 2 (2-way TP)
|
--gpus all |
Docker flag for GPU access |
Required for GPU inference
|
--network host |
Docker networking mode |
Exposes ports 8000, 8001, 8002
|
--shm-size |
Docker shared memory size |
1g (recommended minimum)
|
Related Pages
Page Connections
Double-click a node to navigate. Hold to expand connections.