Environment:Microsoft Onnxruntime CUDA GPU Environment

Field	Value
sources	setup.py, onnxruntime/__init__.py, docs/ORTModule_Training_Guidelines.md
domains	gpu, cuda, training, inference
last_updated	2026-02-10

Overview

NVIDIA CUDA-based GPU environment for accelerated ONNX Runtime inference and training, requiring CUDA 11.x/12.x, cuDNN 8.x/9.x, and compatible NVIDIA drivers.

Description

The CUDA GPU Environment extends the base Python Inference Environment with NVIDIA GPU acceleration through the CUDA Execution Provider. It requires an NVIDIA GPU with appropriate drivers, a CUDA toolkit installation (11.x or 12.x), and cuDNN libraries (8.x or 9.x). The environment loads a set of shared libraries at runtime including cublas, cublasLt, cudart, cufft, curand, nvrtc, and nvJitLink. For training workloads, Flash Attention is available on GPUs with compute capability 8.0 or higher (Ampere architecture and newer). The GPU package adds nvidia-cudnn-cu{major}~=9.0 as a dependency to ensure the correct cuDNN version is available. DLL preloading for CUDA 12.x and newer is handled automatically by the __init__.py module. Building from source requires setting CUDA_HOME, CUDNN_HOME, and CUDACXX environment variables.

Usage

Use this environment whenever you need to:

Run ONNX model inference on NVIDIA GPUs for higher throughput and lower latency.
Train models using the ORTModule wrapper with CUDA acceleration.
Leverage Flash Attention or other GPU-optimized operators.
Perform mixed-precision (FP16/BF16) training or inference on compatible hardware.

System Requirements

Requirement	Minimum	Recommended
NVIDIA GPU	Compute Capability 3.7	Compute Capability 8.0+ (Ampere/Hopper)
CUDA Toolkit	11.x	12.x
cuDNN	8.x	9.x
NVIDIA Driver	470+ (CUDA 11)	535+ (CUDA 12)
Python	3.10	3.12
Operating System	Linux x86_64, Windows x86_64	Linux x86_64
RAM	8 GB	32 GB+
GPU VRAM	4 GB	16 GB+

Dependencies

System Packages

Package	Version	Purpose
CUDA Toolkit	11.x or 12.x	GPU compute runtime and compiler
cuDNN	8.x or 9.x	Deep neural network acceleration library
NVIDIA Driver	470+	GPU kernel driver

CUDA Shared Libraries

The following shared libraries must be available at runtime (setup.py L221-246):

Library	Purpose
`libcublas.so` / `cublas64_*.dll`	Basic Linear Algebra Subroutines on GPU
`libcublasLt.so` / `cublasLt64_*.dll`	Lightweight matrix multiply routines
`libcudart.so` / `cudart64_*.dll`	CUDA runtime API
`libcufft.so` / `cufft64_*.dll`	Fast Fourier Transform on GPU
`libcurand.so` / `curand64_*.dll`	Random number generation on GPU
`libnvrtc.so` / `nvrtc64_*.dll`	Runtime compilation of CUDA kernels
`libnvJitLink.so` / `nvJitLink_*.dll`	JIT linking of CUDA device code
`libcudnn.so.8` or `libcudnn.so.9`	cuDNN neural network primitives

Python Packages

Package	Version Constraint	Purpose
onnxruntime-gpu	1.25.0	GPU-accelerated ONNX Runtime
nvidia-cudnn-cu{major}	~=9.0	cuDNN Python package (setup.py L856)
numpy	>= 1.21.6	Tensor I/O
flatbuffers	(latest)	Model deserialization
protobuf	(latest)	ONNX model loading

Credentials

Variable	Purpose	Required
`CUDA_HOME`	Path to CUDA toolkit installation (e.g., `/usr/local/cuda-12.2`)	Yes (build from source)
`CUDNN_HOME`	Path to cuDNN installation directory	Yes (build from source)
`CUDACXX`	Path to the CUDA C++ compiler (`nvcc`)	Yes (build from source)
`ORT_CUDA_UNAVAILABLE`	When set, disables CUDA provider registration (setup.py L180)	No

Quick Install

Pre-built wheel (recommended):

pip install onnxruntime-gpu==1.25.0

Build from source with CUDA and training support (ORTModule_Training_Guidelines.md L14):

export CUDA_HOME=/usr/local/cuda-11.8
export CUDNN_HOME=/usr/local/cuda-11.8
export CUDACXX=$CUDA_HOME/bin/nvcc

./build.sh --config RelWithDebInfo \
  --use_cuda \
  --enable_training \
  --build_wheel \
  --skip_tests \
  --cuda_version=11.8 \
  --parallel 8 \
  --use_mpi

Code Evidence

CUDA shared library dependencies (setup.py:221-246)

# setup.py:221-246
cuda_dependencies = [
    "libcublas.so",
    "libcublasLt.so",
    "libcudart.so",
    "libcufft.so",
    "libcurand.so",
    "libnvrtc.so",
    "libnvJitLink.so",
    "libcudnn.so.8",
    "libcudnn.so.9",
]

This list enumerates the shared libraries that the CUDA execution provider attempts to load at runtime, covering both cuDNN 8 and cuDNN 9 for backward compatibility.

DLL preloading for CUDA 12+ (init.py:332-337)

# onnxruntime/__init__.py:332-337
# For CUDA 12.x or newer, preload DLLs to ensure they are found
# before the onnxruntime native library is loaded.

The initialization module handles preloading of CUDA DLLs on Windows to avoid load-order issues with CUDA 12.x and newer toolkit versions.

Flash Attention compute capability requirement (ORTModule_Training_Guidelines.md:472)

# docs/ORTModule_Training_Guidelines.md:472
# Flash Attention requires CUDA device capability 8.0+
# (NVIDIA Ampere architecture: A100, A10, A30, etc.)

Build environment variables (ORTModule_Training_Guidelines.md:10-12)

# docs/ORTModule_Training_Guidelines.md:10-12
export CUDA_HOME=/usr/local/cuda-11.8
export CUDNN_HOME=/usr/local/cuda-11.8
export CUDACXX=$CUDA_HOME/bin/nvcc

Common Errors

Error	Cause	Solution
`RuntimeError: CUDA error: no kernel image is available for execution on the device`	GPU compute capability not supported by the installed CUDA toolkit version	Upgrade CUDA toolkit or install a wheel built for your GPU architecture
`OSError: libcudnn.so.9: cannot open shared object file`	cuDNN 9 not installed or not on `LD_LIBRARY_PATH`	Install cuDNN 9: `pip install nvidia-cudnn-cu12~=9.0` or set `LD_LIBRARY_PATH`
`OSError: libcublas.so: cannot open shared object file`	CUDA toolkit not installed or library path not set	Install CUDA toolkit and add `/usr/local/cuda/lib64` to `LD_LIBRARY_PATH`
`RuntimeError: CUDA out of memory`	GPU VRAM exhausted by model or batch size	Reduce batch size, enable memory optimization, or use a GPU with more VRAM
`Flash Attention not available`	GPU compute capability below 8.0	Use an Ampere or newer GPU (A100, A10, RTX 3090, etc.)
`CUDACXX not set`	Building from source without specifying nvcc path	Set `export CUDACXX=/usr/local/cuda/bin/nvcc`

Compatibility Notes

CUDA 11.x vs 12.x: The onnxruntime-gpu package ships separate wheels for CUDA 11 and CUDA 12. Ensure you install the wheel matching your CUDA toolkit version.
cuDNN versions: Both cuDNN 8.x and 9.x are supported. The runtime attempts to load libcudnn.so.9 first, falling back to libcudnn.so.8.
Driver compatibility: CUDA 12.x requires NVIDIA driver 525+ at minimum. For best results, use driver 535 or newer.
Windows: DLL preloading is handled automatically for CUDA 12.x. Ensure CUDA bin directory is on the system PATH.
Linux: Set LD_LIBRARY_PATH to include both CUDA and cuDNN library directories if they are not in standard locations.
Flash Attention: Only available on Ampere (SM 80) and newer architectures. Volta and Turing GPUs cannot use Flash Attention.
Multi-GPU: For multi-GPU training, see the Distributed Training Environment page.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment