Environment:Triton inference server Server GPU CUDA Runtime
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, GPU_Computing |
| Last Updated | 2026-02-13 17:00 GMT |
Overview
NVIDIA GPU environment with CUDA toolkit for running Triton Inference Server with GPU-accelerated inference, requiring minimum compute capability 6.0 (build default) or 7.5 (CMake default).
Description
This environment provides GPU-accelerated inference for the Triton Inference Server. The server is built on top of the NVIDIA NGC base image and includes the full CUDA toolkit, cuDNN, and TensorRT runtime libraries. GPU support is controlled at build time via the TRITON_ENABLE_GPU CMake flag (ON by default). The minimum CUDA compute capability determines which GPU architectures are supported at runtime.
When GPU is disabled, the server runs in CPU-only mode with a reduced feature set: GPU metrics, CUDA shared memory, and Address Sanitizer compatibility are all affected.
Usage
Use this environment for any inference workload requiring GPU acceleration. This includes all TensorRT, CUDA-based, and GPU-optimized model backends. It is the default runtime for the official Triton container images (nvcr.io/nvidia/tritonserver). CPU-only deployments use a separate container variant (<version>-cpu-only-py3).
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Ubuntu 22.04 LTS (default container base) | RHEL builds use lib64 library paths |
| Hardware | NVIDIA GPU with compute capability >= 6.0 | Default build minimum; CMake default is 7.5. A100 (8.0), H100 (9.0) recommended for production |
| CUDA | CUDA 12.8+ (current container) | Bundled in NGC base image |
| cuDNN | cuDNN 9.7.1+ | Bundled in NGC base image |
| TensorRT | TensorRT 10.8.0+ | Required for TensorRT backend |
| Disk | 10GB+ free space | Container image plus model storage |
Dependencies
System Packages
- CUDA Toolkit (bundled in NGC image)
- cuDNN libraries (bundled in NGC image)
- NVIDIA driver compatible with CUDA version
- libcudnn.so.9 (required even for PyTorch CPU-only builds within GPU container)
Container Images
- GPU build: nvcr.io/nvidia/tritonserver:<version>-py3
- GPU minimal: nvcr.io/nvidia/tritonserver:<version>-py3-min
- CPU-only: nvcr.io/nvidia/tritonserver:<version>-cpu-only-py3
- CPU minimal: ubuntu:22.04
Credentials
No credentials are required for the base GPU runtime. Cloud storage backends require additional credentials:
- AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY: For S3 model repository storage
- GOOGLE_APPLICATION_CREDENTIALS: For GCS model repository storage
- AZURE_STORAGE_ACCOUNT / AZURE_STORAGE_KEY: For Azure Blob Storage
Quick Install
# Pull the official GPU-enabled Triton container
docker pull nvcr.io/nvidia/tritonserver:26.01-py3
# Run Triton with a local model repository
docker run --gpus all --rm -p 8000:8000 -p 8001:8001 -p 8002:8002 \
-v /path/to/model/repository:/models \
nvcr.io/nvidia/tritonserver:26.01-py3 tritonserver --model-repository=/models
Code Evidence
GPU enablement flag from CMakeLists.txt:42:
option(TRITON_ENABLE_GPU "Enable GPU support in server" ON)
Minimum compute capability from CMakeLists.txt:45-46:
set(TRITON_MIN_COMPUTE_CAPABILITY "7.5" CACHE STRING
"The minimum CUDA compute capability supported by Triton" )
Build default minimum compute capability from build.py:2645-2651:
# Default --min-compute-capability: "6.0"
GPU metrics dependency chain from CMakeLists.txt:102-104:
if (TRITON_ENABLE_METRICS_GPU AND NOT TRITON_ENABLE_GPU)
message(FATAL_ERROR "TRITON_ENABLE_METRICS_GPU=ON requires TRITON_ENABLE_GPU=ON")
endif()
ASAN incompatibility from CMakeLists.txt:106-108:
if(TRITON_ENABLE_ASAN AND TRITON_ENABLE_GPU)
message(FATAL_ERROR "TRITON_ENABLE_ASAN=ON requires TRITON_ENABLE_GPU=OFF")
endif()
CUDA conditional compilation from src/shared_memory_manager.h:41-44:
#ifdef TRITON_ENABLE_GPU
#include <cuda.h>
#include <cuda_runtime_api.h>
#endif
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| CUDA driver version is insufficient | Driver does not support the CUDA version in the container | Update NVIDIA driver to version compatible with the container CUDA version |
| no CUDA-capable device is detected | No GPU visible to the container | Ensure --gpus all is passed to docker run, or set CUDA_VISIBLE_DEVICES |
| TRITON_ENABLE_METRICS_GPU requires TRITON_ENABLE_GPU | Build attempted GPU metrics without GPU support | Enable TRITON_ENABLE_GPU=ON or disable GPU metrics |
| TRITON_ENABLE_ASAN requires TRITON_ENABLE_GPU=OFF | Address Sanitizer is incompatible with GPU builds | Disable GPU support when using ASAN |
Compatibility Notes
- Jetson (JetPack 5.0): GPU and NVDLA execution supported, but CUDA IPC (shared memory) is not supported. GPU metrics, GCS, S3, and Azure storage are also unavailable on Jetson. Python backend does not support GPU Tensors or Async BLS on Jetson.
- Windows: Supported via Windows containers (Dockerfile.win10.min). Device memory tracker is disabled on Windows Docker builds due to missing CUDA Windows libraries. OpenTelemetry tracing is not supported on Windows.
- RHEL/CentOS: Libraries install to lib64 instead of lib. TensorRT backend on RHEL SBSA is not yet supported (TPRD-712).
- ARM (aarch64/iGPU): Supported via TRITON_IGPU_BUILD flag. Device memory tracker disabled for iGPU builds.
- CPU-only mode: Uses ubuntu:22.04 as base image. No GPU metrics, CUDA shared memory, or GPU-accelerated backends available.