Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Microsoft Onnxruntime CUDA GPU Environment

From Leeroopedia


Field Value
sources setup.py, onnxruntime/__init__.py, docs/ORTModule_Training_Guidelines.md
domains gpu, cuda, training, inference
last_updated 2026-02-10

Overview

NVIDIA CUDA-based GPU environment for accelerated ONNX Runtime inference and training, requiring CUDA 11.x/12.x, cuDNN 8.x/9.x, and compatible NVIDIA drivers.

Description

The CUDA GPU Environment extends the base Python Inference Environment with NVIDIA GPU acceleration through the CUDA Execution Provider. It requires an NVIDIA GPU with appropriate drivers, a CUDA toolkit installation (11.x or 12.x), and cuDNN libraries (8.x or 9.x). The environment loads a set of shared libraries at runtime including cublas, cublasLt, cudart, cufft, curand, nvrtc, and nvJitLink. For training workloads, Flash Attention is available on GPUs with compute capability 8.0 or higher (Ampere architecture and newer). The GPU package adds nvidia-cudnn-cu{major}~=9.0 as a dependency to ensure the correct cuDNN version is available. DLL preloading for CUDA 12.x and newer is handled automatically by the __init__.py module. Building from source requires setting CUDA_HOME, CUDNN_HOME, and CUDACXX environment variables.

Usage

Use this environment whenever you need to:

  • Run ONNX model inference on NVIDIA GPUs for higher throughput and lower latency.
  • Train models using the ORTModule wrapper with CUDA acceleration.
  • Leverage Flash Attention or other GPU-optimized operators.
  • Perform mixed-precision (FP16/BF16) training or inference on compatible hardware.

System Requirements

Requirement Minimum Recommended
NVIDIA GPU Compute Capability 3.7 Compute Capability 8.0+ (Ampere/Hopper)
CUDA Toolkit 11.x 12.x
cuDNN 8.x 9.x
NVIDIA Driver 470+ (CUDA 11) 535+ (CUDA 12)
Python 3.10 3.12
Operating System Linux x86_64, Windows x86_64 Linux x86_64
RAM 8 GB 32 GB+
GPU VRAM 4 GB 16 GB+

Dependencies

System Packages

Package Version Purpose
CUDA Toolkit 11.x or 12.x GPU compute runtime and compiler
cuDNN 8.x or 9.x Deep neural network acceleration library
NVIDIA Driver 470+ GPU kernel driver

CUDA Shared Libraries

The following shared libraries must be available at runtime (setup.py L221-246):

Library Purpose
libcublas.so / cublas64_*.dll Basic Linear Algebra Subroutines on GPU
libcublasLt.so / cublasLt64_*.dll Lightweight matrix multiply routines
libcudart.so / cudart64_*.dll CUDA runtime API
libcufft.so / cufft64_*.dll Fast Fourier Transform on GPU
libcurand.so / curand64_*.dll Random number generation on GPU
libnvrtc.so / nvrtc64_*.dll Runtime compilation of CUDA kernels
libnvJitLink.so / nvJitLink_*.dll JIT linking of CUDA device code
libcudnn.so.8 or libcudnn.so.9 cuDNN neural network primitives

Python Packages

Package Version Constraint Purpose
onnxruntime-gpu 1.25.0 GPU-accelerated ONNX Runtime
nvidia-cudnn-cu{major} ~=9.0 cuDNN Python package (setup.py L856)
numpy >= 1.21.6 Tensor I/O
flatbuffers (latest) Model deserialization
protobuf (latest) ONNX model loading

Credentials

Variable Purpose Required
CUDA_HOME Path to CUDA toolkit installation (e.g., /usr/local/cuda-12.2) Yes (build from source)
CUDNN_HOME Path to cuDNN installation directory Yes (build from source)
CUDACXX Path to the CUDA C++ compiler (nvcc) Yes (build from source)
ORT_CUDA_UNAVAILABLE When set, disables CUDA provider registration (setup.py L180) No

Quick Install

Pre-built wheel (recommended):

pip install onnxruntime-gpu==1.25.0

Build from source with CUDA and training support (ORTModule_Training_Guidelines.md L14):

export CUDA_HOME=/usr/local/cuda-11.8
export CUDNN_HOME=/usr/local/cuda-11.8
export CUDACXX=$CUDA_HOME/bin/nvcc

./build.sh --config RelWithDebInfo \
  --use_cuda \
  --enable_training \
  --build_wheel \
  --skip_tests \
  --cuda_version=11.8 \
  --parallel 8 \
  --use_mpi

Code Evidence

CUDA shared library dependencies (setup.py:221-246)

# setup.py:221-246
cuda_dependencies = [
    "libcublas.so",
    "libcublasLt.so",
    "libcudart.so",
    "libcufft.so",
    "libcurand.so",
    "libnvrtc.so",
    "libnvJitLink.so",
    "libcudnn.so.8",
    "libcudnn.so.9",
]

This list enumerates the shared libraries that the CUDA execution provider attempts to load at runtime, covering both cuDNN 8 and cuDNN 9 for backward compatibility.

DLL preloading for CUDA 12+ (__init__.py:332-337)

# onnxruntime/__init__.py:332-337
# For CUDA 12.x or newer, preload DLLs to ensure they are found
# before the onnxruntime native library is loaded.

The initialization module handles preloading of CUDA DLLs on Windows to avoid load-order issues with CUDA 12.x and newer toolkit versions.

Flash Attention compute capability requirement (ORTModule_Training_Guidelines.md:472)

# docs/ORTModule_Training_Guidelines.md:472
# Flash Attention requires CUDA device capability 8.0+
# (NVIDIA Ampere architecture: A100, A10, A30, etc.)

Build environment variables (ORTModule_Training_Guidelines.md:10-12)

# docs/ORTModule_Training_Guidelines.md:10-12
export CUDA_HOME=/usr/local/cuda-11.8
export CUDNN_HOME=/usr/local/cuda-11.8
export CUDACXX=$CUDA_HOME/bin/nvcc

Common Errors

Error Cause Solution
RuntimeError: CUDA error: no kernel image is available for execution on the device GPU compute capability not supported by the installed CUDA toolkit version Upgrade CUDA toolkit or install a wheel built for your GPU architecture
OSError: libcudnn.so.9: cannot open shared object file cuDNN 9 not installed or not on LD_LIBRARY_PATH Install cuDNN 9: pip install nvidia-cudnn-cu12~=9.0 or set LD_LIBRARY_PATH
OSError: libcublas.so: cannot open shared object file CUDA toolkit not installed or library path not set Install CUDA toolkit and add /usr/local/cuda/lib64 to LD_LIBRARY_PATH
RuntimeError: CUDA out of memory GPU VRAM exhausted by model or batch size Reduce batch size, enable memory optimization, or use a GPU with more VRAM
Flash Attention not available GPU compute capability below 8.0 Use an Ampere or newer GPU (A100, A10, RTX 3090, etc.)
CUDACXX not set Building from source without specifying nvcc path Set export CUDACXX=/usr/local/cuda/bin/nvcc

Compatibility Notes

  • CUDA 11.x vs 12.x: The onnxruntime-gpu package ships separate wheels for CUDA 11 and CUDA 12. Ensure you install the wheel matching your CUDA toolkit version.
  • cuDNN versions: Both cuDNN 8.x and 9.x are supported. The runtime attempts to load libcudnn.so.9 first, falling back to libcudnn.so.8.
  • Driver compatibility: CUDA 12.x requires NVIDIA driver 525+ at minimum. For best results, use driver 535 or newer.
  • Windows: DLL preloading is handled automatically for CUDA 12.x. Ensure CUDA bin directory is on the system PATH.
  • Linux: Set LD_LIBRARY_PATH to include both CUDA and cuDNN library directories if they are not in standard locations.
  • Flash Attention: Only available on Ampere (SM 80) and newer architectures. Volta and Turing GPUs cannot use Flash Attention.
  • Multi-GPU: For multi-GPU training, see the Distributed Training Environment page.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment