Environment:NVIDIA TransformerEngine CUDA Toolkit Requirements

Knowledge Sources	NVIDIA TransformerEngine TE Installation Guide
Domains	Infrastructure, Deep_Learning, GPU_Computing
Last Updated	2026-02-07 21:00 GMT

Overview

CUDA Toolkit 12.1+ environment with cuDNN frontend API, cuBLAS, and optional NVSHMEM for building and running NVIDIA TransformerEngine.

Description

This environment defines the CUDA-level dependencies required for TransformerEngine. The library is built as a C++17/CUDA 17 project using CMake. It requires CUDA Toolkit 12.1 at minimum, with newer versions (12.8+, 13.0+) unlocking support for additional GPU architectures (Blackwell SM 100/120). The cuDNN frontend API is required as a git submodule, and CUTLASS headers are used for grouped GEMM operations. For distributed communication overlap, optional MPI and NVSHMEM support can be enabled.

Usage

Use this environment for building TransformerEngine from source or running any workload that uses the TE C++ backend (fused attention, GEMM operations, normalization kernels). This is a mandatory prerequisite for all TE implementations.

System Requirements

Category	Requirement	Notes
OS	Linux (Ubuntu 20.04+)	Windows not officially supported
Hardware	NVIDIA GPU (SM 7.0+)	Volta, Turing, Ampere, Ada, Hopper, Blackwell
Disk	10GB+ free	For build artifacts and compiled kernels
CMake	3.21+	Required for build system
C++ Standard	C++17	Required for both host and device code

Dependencies

System Packages

`cuda-toolkit` >= 12.1 (FATAL ERROR if < 12.1)
`cudnn` (frontend API via git submodule at `3rdparty/cudnn-frontend`)
`cutlass` (headers via git submodule at `3rdparty/cutlass`)
`nccl` (for distributed communication)
`cmake` >= 3.21
`ninja` (optional, for faster builds)
`ccache` (optional, enabled via `NVTE_USE_CCACHE=1`)
`mpi` (optional, for userbuffers MPI bootstrap via `NVTE_UB_WITH_MPI=1`)
`nvshmem` (optional, via `NVTE_ENABLE_NVSHMEM=1`)

CUDA Architecture Support

CUDA Toolkit Version	Supported GPU Architectures
< 12.1	Not supported (build fails)
12.1 - 12.7	SM 70, 80, 89, 90 (Volta through Hopper)
12.8 - 12.x	SM 70, 80, 89, 90, 100, 120 (adds Blackwell)
13.0+	SM 75, 80, 89, 90, 100, 120 (drops Volta SM 70, adds Turing SM 75)

Feature Availability by CUDA Version

Feature	Minimum CUDA Version
FP4 data types	CUDA 12.8+
SM 100a/103a specific arch codes	CUDA 12.9+
SM 120f (Blackwell fat binary)	CUDA 12.9+
SM 110 architecture support	CUDA 13.0+

Credentials

No credentials required for the CUDA toolkit environment itself.

Quick Install

# Verify CUDA toolkit version
nvcc --version  # Must show 12.1 or higher

# Clone with submodules (required for cuDNN frontend and CUTLASS)
git clone --recursive https://github.com/NVIDIA/TransformerEngine.git
cd TransformerEngine

# If already cloned without submodules:
git submodule update --init --recursive

# Build with specific architectures (reduces build time)
NVTE_CUDA_ARCHS="80;90" pip install .

# For debug builds
NVTE_BUILD_DEBUG=1 pip install .

Code Evidence

CUDA 12.1 minimum version check from `transformer_engine/common/CMakeLists.txt:24-26`:

find_package(CUDAToolkit REQUIRED)
if (CUDAToolkit_VERSION VERSION_LESS 12.1)
  message(FATAL_ERROR "CUDA 12.1+ is required, but found CUDA ${CUDAToolkit_VERSION}")
endif()

GPU architecture selection based on CUDA version from `transformer_engine/common/CMakeLists.txt:29-37`:

if(NOT DEFINED CMAKE_CUDA_ARCHITECTURES)
  if (CUDAToolkit_VERSION VERSION_GREATER_EQUAL 13.0)
    set(CMAKE_CUDA_ARCHITECTURES 75 80 89 90 100 120)
  elseif (CUDAToolkit_VERSION VERSION_GREATER_EQUAL 12.8)
    set(CMAKE_CUDA_ARCHITECTURES 70 80 89 90 100 120)
  else ()
    set(CMAKE_CUDA_ARCHITECTURES 70 80 89 90)
  endif()
endif()

cuDNN frontend API requirement from `transformer_engine/common/CMakeLists.txt:82-91`:

set(CUDNN_FRONTEND_INCLUDE_DIR
    "${CMAKE_CURRENT_SOURCE_DIR}/../../3rdparty/cudnn-frontend/include")
if(NOT EXISTS "${CUDNN_FRONTEND_INCLUDE_DIR}")
    message(FATAL_ERROR
            "Could not find cuDNN frontend API at ${CUDNN_FRONTEND_INCLUDE_DIR}. "
            "Try running 'git submodule update --init --recursive' "
            "within the Transformer Engine source.")
endif()

FP4 type support gated on CUDA version from `transformer_engine/common/common.h:11`:

#define FP4_TYPE_SUPPORTED (CUDA_VERSION >= 12080)

Common Errors

Error Message	Cause	Solution
`FATAL_ERROR: CUDA 12.1+ is required, but found CUDA X.Y`	CUDA Toolkit version too old	Upgrade CUDA Toolkit to 12.1 or newer
`Could not find cuDNN frontend API`	Git submodules not initialized	Run `git submodule update --init --recursive`
`Unsupported cuda version X.Y`	CUDA major version not 12 or 13	Install CUDA 12.x or 13.x
Build OOM during compilation	Too many parallel compile jobs	Set `MAX_JOBS=1 NVTE_BUILD_THREADS_PER_JOB=1`

Compatibility Notes

CUDA 12.0: Build scripts require CUDA 12.0+ for PyTorch extension, but CMake requires 12.1+ for the core library.
CUDA 12.8: Enables FP4 data types and Blackwell (SM 100/120) architecture support.
CUDA 12.9: Enables FP8 block-scaled GEMM (requires SM 9.0+) and refined Blackwell arch codes.
CUDA 13.0: Drops Volta (SM 70) from default architecture list, adds SM 75 (Turing) and SM 110.
Custom architectures: Set `NVTE_CUDA_ARCHS="80;90"` to compile only for needed GPUs, significantly reducing build time.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment