Environment:Microsoft Onnxruntime CUDA GPU Environment
| Field | Value |
|---|---|
| sources | setup.py, onnxruntime/__init__.py, docs/ORTModule_Training_Guidelines.md |
| domains | gpu, cuda, training, inference |
| last_updated | 2026-02-10 |
Overview
NVIDIA CUDA-based GPU environment for accelerated ONNX Runtime inference and training, requiring CUDA 11.x/12.x, cuDNN 8.x/9.x, and compatible NVIDIA drivers.
Description
The CUDA GPU Environment extends the base Python Inference Environment with NVIDIA GPU acceleration through the CUDA Execution Provider. It requires an NVIDIA GPU with appropriate drivers, a CUDA toolkit installation (11.x or 12.x), and cuDNN libraries (8.x or 9.x). The environment loads a set of shared libraries at runtime including cublas, cublasLt, cudart, cufft, curand, nvrtc, and nvJitLink. For training workloads, Flash Attention is available on GPUs with compute capability 8.0 or higher (Ampere architecture and newer). The GPU package adds nvidia-cudnn-cu{major}~=9.0 as a dependency to ensure the correct cuDNN version is available. DLL preloading for CUDA 12.x and newer is handled automatically by the __init__.py module. Building from source requires setting CUDA_HOME, CUDNN_HOME, and CUDACXX environment variables.
Usage
Use this environment whenever you need to:
- Run ONNX model inference on NVIDIA GPUs for higher throughput and lower latency.
- Train models using the ORTModule wrapper with CUDA acceleration.
- Leverage Flash Attention or other GPU-optimized operators.
- Perform mixed-precision (FP16/BF16) training or inference on compatible hardware.
System Requirements
| Requirement | Minimum | Recommended |
|---|---|---|
| NVIDIA GPU | Compute Capability 3.7 | Compute Capability 8.0+ (Ampere/Hopper) |
| CUDA Toolkit | 11.x | 12.x |
| cuDNN | 8.x | 9.x |
| NVIDIA Driver | 470+ (CUDA 11) | 535+ (CUDA 12) |
| Python | 3.10 | 3.12 |
| Operating System | Linux x86_64, Windows x86_64 | Linux x86_64 |
| RAM | 8 GB | 32 GB+ |
| GPU VRAM | 4 GB | 16 GB+ |
Dependencies
System Packages
| Package | Version | Purpose |
|---|---|---|
| CUDA Toolkit | 11.x or 12.x | GPU compute runtime and compiler |
| cuDNN | 8.x or 9.x | Deep neural network acceleration library |
| NVIDIA Driver | 470+ | GPU kernel driver |
The following shared libraries must be available at runtime (setup.py L221-246):
| Library | Purpose |
|---|---|
libcublas.so / cublas64_*.dll |
Basic Linear Algebra Subroutines on GPU |
libcublasLt.so / cublasLt64_*.dll |
Lightweight matrix multiply routines |
libcudart.so / cudart64_*.dll |
CUDA runtime API |
libcufft.so / cufft64_*.dll |
Fast Fourier Transform on GPU |
libcurand.so / curand64_*.dll |
Random number generation on GPU |
libnvrtc.so / nvrtc64_*.dll |
Runtime compilation of CUDA kernels |
libnvJitLink.so / nvJitLink_*.dll |
JIT linking of CUDA device code |
libcudnn.so.8 or libcudnn.so.9 |
cuDNN neural network primitives |
Python Packages
| Package | Version Constraint | Purpose |
|---|---|---|
| onnxruntime-gpu | 1.25.0 | GPU-accelerated ONNX Runtime |
| nvidia-cudnn-cu{major} | ~=9.0 | cuDNN Python package (setup.py L856) |
| numpy | >= 1.21.6 | Tensor I/O |
| flatbuffers | (latest) | Model deserialization |
| protobuf | (latest) | ONNX model loading |
Credentials
| Variable | Purpose | Required |
|---|---|---|
CUDA_HOME |
Path to CUDA toolkit installation (e.g., /usr/local/cuda-12.2) |
Yes (build from source) |
CUDNN_HOME |
Path to cuDNN installation directory | Yes (build from source) |
CUDACXX |
Path to the CUDA C++ compiler (nvcc) |
Yes (build from source) |
ORT_CUDA_UNAVAILABLE |
When set, disables CUDA provider registration (setup.py L180) | No |
Quick Install
Pre-built wheel (recommended):
pip install onnxruntime-gpu==1.25.0
Build from source with CUDA and training support (ORTModule_Training_Guidelines.md L14):
export CUDA_HOME=/usr/local/cuda-11.8 export CUDNN_HOME=/usr/local/cuda-11.8 export CUDACXX=$CUDA_HOME/bin/nvcc ./build.sh --config RelWithDebInfo \ --use_cuda \ --enable_training \ --build_wheel \ --skip_tests \ --cuda_version=11.8 \ --parallel 8 \ --use_mpi
Code Evidence
# setup.py:221-246
cuda_dependencies = [
"libcublas.so",
"libcublasLt.so",
"libcudart.so",
"libcufft.so",
"libcurand.so",
"libnvrtc.so",
"libnvJitLink.so",
"libcudnn.so.8",
"libcudnn.so.9",
]
This list enumerates the shared libraries that the CUDA execution provider attempts to load at runtime, covering both cuDNN 8 and cuDNN 9 for backward compatibility.
DLL preloading for CUDA 12+ (__init__.py:332-337)
# onnxruntime/__init__.py:332-337 # For CUDA 12.x or newer, preload DLLs to ensure they are found # before the onnxruntime native library is loaded.
The initialization module handles preloading of CUDA DLLs on Windows to avoid load-order issues with CUDA 12.x and newer toolkit versions.
Flash Attention compute capability requirement (ORTModule_Training_Guidelines.md:472)
# docs/ORTModule_Training_Guidelines.md:472 # Flash Attention requires CUDA device capability 8.0+ # (NVIDIA Ampere architecture: A100, A10, A30, etc.)
Build environment variables (ORTModule_Training_Guidelines.md:10-12)
# docs/ORTModule_Training_Guidelines.md:10-12 export CUDA_HOME=/usr/local/cuda-11.8 export CUDNN_HOME=/usr/local/cuda-11.8 export CUDACXX=$CUDA_HOME/bin/nvcc
Common Errors
| Error | Cause | Solution |
|---|---|---|
RuntimeError: CUDA error: no kernel image is available for execution on the device |
GPU compute capability not supported by the installed CUDA toolkit version | Upgrade CUDA toolkit or install a wheel built for your GPU architecture |
OSError: libcudnn.so.9: cannot open shared object file |
cuDNN 9 not installed or not on LD_LIBRARY_PATH |
Install cuDNN 9: pip install nvidia-cudnn-cu12~=9.0 or set LD_LIBRARY_PATH
|
OSError: libcublas.so: cannot open shared object file |
CUDA toolkit not installed or library path not set | Install CUDA toolkit and add /usr/local/cuda/lib64 to LD_LIBRARY_PATH
|
RuntimeError: CUDA out of memory |
GPU VRAM exhausted by model or batch size | Reduce batch size, enable memory optimization, or use a GPU with more VRAM |
Flash Attention not available |
GPU compute capability below 8.0 | Use an Ampere or newer GPU (A100, A10, RTX 3090, etc.) |
CUDACXX not set |
Building from source without specifying nvcc path | Set export CUDACXX=/usr/local/cuda/bin/nvcc
|
Compatibility Notes
- CUDA 11.x vs 12.x: The onnxruntime-gpu package ships separate wheels for CUDA 11 and CUDA 12. Ensure you install the wheel matching your CUDA toolkit version.
- cuDNN versions: Both cuDNN 8.x and 9.x are supported. The runtime attempts to load
libcudnn.so.9first, falling back tolibcudnn.so.8. - Driver compatibility: CUDA 12.x requires NVIDIA driver 525+ at minimum. For best results, use driver 535 or newer.
- Windows: DLL preloading is handled automatically for CUDA 12.x. Ensure CUDA bin directory is on the system
PATH. - Linux: Set
LD_LIBRARY_PATHto include both CUDA and cuDNN library directories if they are not in standard locations. - Flash Attention: Only available on Ampere (SM 80) and newer architectures. Volta and Turing GPUs cannot use Flash Attention.
- Multi-GPU: For multi-GPU training, see the Distributed Training Environment page.