Environment:NVIDIA TransformerEngine CUDA Toolkit Requirements
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Deep_Learning, GPU_Computing |
| Last Updated | 2026-02-07 21:00 GMT |
Overview
CUDA Toolkit 12.1+ environment with cuDNN frontend API, cuBLAS, and optional NVSHMEM for building and running NVIDIA TransformerEngine.
Description
This environment defines the CUDA-level dependencies required for TransformerEngine. The library is built as a C++17/CUDA 17 project using CMake. It requires CUDA Toolkit 12.1 at minimum, with newer versions (12.8+, 13.0+) unlocking support for additional GPU architectures (Blackwell SM 100/120). The cuDNN frontend API is required as a git submodule, and CUTLASS headers are used for grouped GEMM operations. For distributed communication overlap, optional MPI and NVSHMEM support can be enabled.
Usage
Use this environment for building TransformerEngine from source or running any workload that uses the TE C++ backend (fused attention, GEMM operations, normalization kernels). This is a mandatory prerequisite for all TE implementations.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux (Ubuntu 20.04+) | Windows not officially supported |
| Hardware | NVIDIA GPU (SM 7.0+) | Volta, Turing, Ampere, Ada, Hopper, Blackwell |
| Disk | 10GB+ free | For build artifacts and compiled kernels |
| CMake | 3.21+ | Required for build system |
| C++ Standard | C++17 | Required for both host and device code |
Dependencies
System Packages
- `cuda-toolkit` >= 12.1 (FATAL ERROR if < 12.1)
- `cudnn` (frontend API via git submodule at `3rdparty/cudnn-frontend`)
- `cutlass` (headers via git submodule at `3rdparty/cutlass`)
- `nccl` (for distributed communication)
- `cmake` >= 3.21
- `ninja` (optional, for faster builds)
- `ccache` (optional, enabled via `NVTE_USE_CCACHE=1`)
- `mpi` (optional, for userbuffers MPI bootstrap via `NVTE_UB_WITH_MPI=1`)
- `nvshmem` (optional, via `NVTE_ENABLE_NVSHMEM=1`)
CUDA Architecture Support
| CUDA Toolkit Version | Supported GPU Architectures |
|---|---|
| < 12.1 | Not supported (build fails) |
| 12.1 - 12.7 | SM 70, 80, 89, 90 (Volta through Hopper) |
| 12.8 - 12.x | SM 70, 80, 89, 90, 100, 120 (adds Blackwell) |
| 13.0+ | SM 75, 80, 89, 90, 100, 120 (drops Volta SM 70, adds Turing SM 75) |
Feature Availability by CUDA Version
| Feature | Minimum CUDA Version |
|---|---|
| FP4 data types | CUDA 12.8+ |
| SM 100a/103a specific arch codes | CUDA 12.9+ |
| SM 120f (Blackwell fat binary) | CUDA 12.9+ |
| SM 110 architecture support | CUDA 13.0+ |
Credentials
No credentials required for the CUDA toolkit environment itself.
Quick Install
# Verify CUDA toolkit version
nvcc --version # Must show 12.1 or higher
# Clone with submodules (required for cuDNN frontend and CUTLASS)
git clone --recursive https://github.com/NVIDIA/TransformerEngine.git
cd TransformerEngine
# If already cloned without submodules:
git submodule update --init --recursive
# Build with specific architectures (reduces build time)
NVTE_CUDA_ARCHS="80;90" pip install .
# For debug builds
NVTE_BUILD_DEBUG=1 pip install .
Code Evidence
CUDA 12.1 minimum version check from `transformer_engine/common/CMakeLists.txt:24-26`:
find_package(CUDAToolkit REQUIRED)
if (CUDAToolkit_VERSION VERSION_LESS 12.1)
message(FATAL_ERROR "CUDA 12.1+ is required, but found CUDA ${CUDAToolkit_VERSION}")
endif()
GPU architecture selection based on CUDA version from `transformer_engine/common/CMakeLists.txt:29-37`:
if(NOT DEFINED CMAKE_CUDA_ARCHITECTURES)
if (CUDAToolkit_VERSION VERSION_GREATER_EQUAL 13.0)
set(CMAKE_CUDA_ARCHITECTURES 75 80 89 90 100 120)
elseif (CUDAToolkit_VERSION VERSION_GREATER_EQUAL 12.8)
set(CMAKE_CUDA_ARCHITECTURES 70 80 89 90 100 120)
else ()
set(CMAKE_CUDA_ARCHITECTURES 70 80 89 90)
endif()
endif()
cuDNN frontend API requirement from `transformer_engine/common/CMakeLists.txt:82-91`:
set(CUDNN_FRONTEND_INCLUDE_DIR
"${CMAKE_CURRENT_SOURCE_DIR}/../../3rdparty/cudnn-frontend/include")
if(NOT EXISTS "${CUDNN_FRONTEND_INCLUDE_DIR}")
message(FATAL_ERROR
"Could not find cuDNN frontend API at ${CUDNN_FRONTEND_INCLUDE_DIR}. "
"Try running 'git submodule update --init --recursive' "
"within the Transformer Engine source.")
endif()
FP4 type support gated on CUDA version from `transformer_engine/common/common.h:11`:
#define FP4_TYPE_SUPPORTED (CUDA_VERSION >= 12080)
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `FATAL_ERROR: CUDA 12.1+ is required, but found CUDA X.Y` | CUDA Toolkit version too old | Upgrade CUDA Toolkit to 12.1 or newer |
| `Could not find cuDNN frontend API` | Git submodules not initialized | Run `git submodule update --init --recursive` |
| `Unsupported cuda version X.Y` | CUDA major version not 12 or 13 | Install CUDA 12.x or 13.x |
| Build OOM during compilation | Too many parallel compile jobs | Set `MAX_JOBS=1 NVTE_BUILD_THREADS_PER_JOB=1` |
Compatibility Notes
- CUDA 12.0: Build scripts require CUDA 12.0+ for PyTorch extension, but CMake requires 12.1+ for the core library.
- CUDA 12.8: Enables FP4 data types and Blackwell (SM 100/120) architecture support.
- CUDA 12.9: Enables FP8 block-scaled GEMM (requires SM 9.0+) and refined Blackwell arch codes.
- CUDA 13.0: Drops Volta (SM 70) from default architecture list, adds SM 75 (Turing) and SM 110.
- Custom architectures: Set `NVTE_CUDA_ARCHS="80;90"` to compile only for needed GPUs, significantly reducing build time.
Related Pages
- Implementation:NVIDIA_TransformerEngine_TE_Linear
- Implementation:NVIDIA_TransformerEngine_TE_LayerNorm
- Implementation:NVIDIA_TransformerEngine_TE_LayerNormLinear
- Implementation:NVIDIA_TransformerEngine_TE_LayerNormMLP
- Implementation:NVIDIA_TransformerEngine_TE_DotProductAttention
- Implementation:NVIDIA_TransformerEngine_TE_TransformerLayer
- Implementation:NVIDIA_TransformerEngine_Initialize_UB
- Implementation:NVIDIA_TransformerEngine_TE_Autocast
- Implementation:NVIDIA_TransformerEngine_DelayedScaling_Recipe
- Implementation:NVIDIA_TransformerEngine_Float8CurrentScaling_Recipe