Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Environment:Triton inference server Server GPU CUDA Runtime

From Leeroopedia
Knowledge Sources
Domains Infrastructure, GPU_Computing
Last Updated 2026-02-13 17:00 GMT

Overview

NVIDIA GPU environment with CUDA toolkit for running Triton Inference Server with GPU-accelerated inference, requiring minimum compute capability 6.0 (build default) or 7.5 (CMake default).

Description

This environment provides GPU-accelerated inference for the Triton Inference Server. The server is built on top of the NVIDIA NGC base image and includes the full CUDA toolkit, cuDNN, and TensorRT runtime libraries. GPU support is controlled at build time via the TRITON_ENABLE_GPU CMake flag (ON by default). The minimum CUDA compute capability determines which GPU architectures are supported at runtime.

When GPU is disabled, the server runs in CPU-only mode with a reduced feature set: GPU metrics, CUDA shared memory, and Address Sanitizer compatibility are all affected.

Usage

Use this environment for any inference workload requiring GPU acceleration. This includes all TensorRT, CUDA-based, and GPU-optimized model backends. It is the default runtime for the official Triton container images (nvcr.io/nvidia/tritonserver). CPU-only deployments use a separate container variant (<version>-cpu-only-py3).

System Requirements

Category Requirement Notes
OS Ubuntu 22.04 LTS (default container base) RHEL builds use lib64 library paths
Hardware NVIDIA GPU with compute capability >= 6.0 Default build minimum; CMake default is 7.5. A100 (8.0), H100 (9.0) recommended for production
CUDA CUDA 12.8+ (current container) Bundled in NGC base image
cuDNN cuDNN 9.7.1+ Bundled in NGC base image
TensorRT TensorRT 10.8.0+ Required for TensorRT backend
Disk 10GB+ free space Container image plus model storage

Dependencies

System Packages

  • CUDA Toolkit (bundled in NGC image)
  • cuDNN libraries (bundled in NGC image)
  • NVIDIA driver compatible with CUDA version
  • libcudnn.so.9 (required even for PyTorch CPU-only builds within GPU container)

Container Images

  • GPU build: nvcr.io/nvidia/tritonserver:<version>-py3
  • GPU minimal: nvcr.io/nvidia/tritonserver:<version>-py3-min
  • CPU-only: nvcr.io/nvidia/tritonserver:<version>-cpu-only-py3
  • CPU minimal: ubuntu:22.04

Credentials

No credentials are required for the base GPU runtime. Cloud storage backends require additional credentials:

  • AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY: For S3 model repository storage
  • GOOGLE_APPLICATION_CREDENTIALS: For GCS model repository storage
  • AZURE_STORAGE_ACCOUNT / AZURE_STORAGE_KEY: For Azure Blob Storage

Quick Install

# Pull the official GPU-enabled Triton container
docker pull nvcr.io/nvidia/tritonserver:26.01-py3

# Run Triton with a local model repository
docker run --gpus all --rm -p 8000:8000 -p 8001:8001 -p 8002:8002 \
  -v /path/to/model/repository:/models \
  nvcr.io/nvidia/tritonserver:26.01-py3 tritonserver --model-repository=/models

Code Evidence

GPU enablement flag from CMakeLists.txt:42:

option(TRITON_ENABLE_GPU "Enable GPU support in server" ON)

Minimum compute capability from CMakeLists.txt:45-46:

set(TRITON_MIN_COMPUTE_CAPABILITY "7.5" CACHE STRING
    "The minimum CUDA compute capability supported by Triton" )

Build default minimum compute capability from build.py:2645-2651:

# Default --min-compute-capability: "6.0"

GPU metrics dependency chain from CMakeLists.txt:102-104:

if (TRITON_ENABLE_METRICS_GPU AND NOT TRITON_ENABLE_GPU)
  message(FATAL_ERROR "TRITON_ENABLE_METRICS_GPU=ON requires TRITON_ENABLE_GPU=ON")
endif()

ASAN incompatibility from CMakeLists.txt:106-108:

if(TRITON_ENABLE_ASAN AND TRITON_ENABLE_GPU)
  message(FATAL_ERROR "TRITON_ENABLE_ASAN=ON requires TRITON_ENABLE_GPU=OFF")
endif()

CUDA conditional compilation from src/shared_memory_manager.h:41-44:

#ifdef TRITON_ENABLE_GPU
#include <cuda.h>
#include <cuda_runtime_api.h>
#endif

Common Errors

Error Message Cause Solution
CUDA driver version is insufficient Driver does not support the CUDA version in the container Update NVIDIA driver to version compatible with the container CUDA version
no CUDA-capable device is detected No GPU visible to the container Ensure --gpus all is passed to docker run, or set CUDA_VISIBLE_DEVICES
TRITON_ENABLE_METRICS_GPU requires TRITON_ENABLE_GPU Build attempted GPU metrics without GPU support Enable TRITON_ENABLE_GPU=ON or disable GPU metrics
TRITON_ENABLE_ASAN requires TRITON_ENABLE_GPU=OFF Address Sanitizer is incompatible with GPU builds Disable GPU support when using ASAN

Compatibility Notes

  • Jetson (JetPack 5.0): GPU and NVDLA execution supported, but CUDA IPC (shared memory) is not supported. GPU metrics, GCS, S3, and Azure storage are also unavailable on Jetson. Python backend does not support GPU Tensors or Async BLS on Jetson.
  • Windows: Supported via Windows containers (Dockerfile.win10.min). Device memory tracker is disabled on Windows Docker builds due to missing CUDA Windows libraries. OpenTelemetry tracing is not supported on Windows.
  • RHEL/CentOS: Libraries install to lib64 instead of lib. TensorRT backend on RHEL SBSA is not yet supported (TPRD-712).
  • ARM (aarch64/iGPU): Supported via TRITON_IGPU_BUILD flag. Device memory tracker disabled for iGPU builds.
  • CPU-only mode: Uses ubuntu:22.04 as base image. No GPU metrics, CUDA shared memory, or GPU-accelerated backends available.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment