Environment:NVIDIA TransformerEngine GPU Compute Capability

Knowledge Sources	NVIDIA TransformerEngine NVIDIA GPU Compute Capability
Domains	Infrastructure, GPU_Computing, Deep_Learning
Last Updated	2026-02-07 21:00 GMT

Overview

GPU hardware requirements matrix defining which NVIDIA GPU architectures (SM 7.0 through SM 12.0+) enable which TransformerEngine features (BF16, FP8, MXFP8, NVFP4).

Description

TransformerEngine uses GPU compute capability to gate feature availability at runtime. The minimum supported architecture is SM 7.0 (Volta) for basic operations, SM 8.0 (Ampere) for BF16, SM 8.9 (Ada Lovelace) for FP8 with additional CUDA/cuBLAS requirements, SM 9.0 (Hopper) for full FP8 support and FlashAttention-3, and SM 10.0 (Blackwell) for MXFP8 and NVFP4 quantization. Each feature check is cached via `functools.lru_cache` for performance.

Usage

Use this environment reference to determine which TransformerEngine features are available on your GPU hardware. This is critical for selecting the correct FP8 recipe, attention backend, and quantization strategy.

System Requirements

Category	Requirement	Notes
GPU (Minimum)	NVIDIA SM 7.0 (Volta V100)	Basic TE operations, no FP8
GPU (BF16)	NVIDIA SM 8.0+ (Ampere A100)	BF16 training support
GPU (FP8)	NVIDIA SM 8.9+ (Ada L40/L4)	Requires CUDA 12.1+ and cuBLASLt 12.1.3+
GPU (Full FP8)	NVIDIA SM 9.0+ (Hopper H100)	Full native FP8 support
GPU (MXFP8/NVFP4)	NVIDIA SM 10.0+ (Blackwell B200)	Advanced quantization formats
VRAM	Varies by model	Minimum 16GB recommended for training

Dependencies

Feature Availability Matrix

Feature	Min SM	Additional Requirements	GPU Examples
Basic Operations	7.0	None	V100
BF16 Compute	8.0	None	A100, A30
FlashAttention-2	8.0	`flash-attn` >= 2.1.1	A100, L40, H100
FP8 (Ada)	8.9	CUDA >= 12.1, cuBLASLt >= 12.1.3	L40, L4, RTX 4090
FP8 (Native)	9.0	None	H100, H200
FlashAttention-3	9.0	`flash_attn_3` package	H100, H200
FP8 Block Scaling	9.0	CUDA >= 12.9	H100, H200
MXFP8	10.0	Not on SM 12.0+ yet	B200, GB200
NVFP4	10.0	None	B200, GB200
Non-TN FP8 GEMM	10.0-11.x or 13.0+	Architecture-specific	B200+

Credentials

No credentials required.

Quick Install

# Check your GPU compute capability
python -c "import torch; print(torch.cuda.get_device_capability())"

# Check if FP8 is available
python -c "from transformer_engine.pytorch import is_fp8_available; print(is_fp8_available())"

# Check if MXFP8 is available
python -c "from transformer_engine.pytorch import is_mxfp8_available; print(is_mxfp8_available())"

Code Evidence

FP8 support check from `transformer_engine/pytorch/quantization.py:48-58`:

@functools.lru_cache(maxsize=None)
def check_fp8_support() -> Tuple[bool, str]:
    """Return if fp8 support is available"""
    if get_device_compute_capability() >= (9, 0):  # hopper and above
        return True, ""
    if get_device_compute_capability() < (8, 9):  # pre-ada
        return False, "Device compute capability 8.9 or higher required for FP8 execution."
    if tex.get_cublasLt_version() < 120103:
        return False, "CublasLt version 12.1.3.x or higher required for FP8 execution on Ada."
    if float(torch.version.cuda) < 12.1:
        return False, "Cuda version 12.1 or higher required for FP8 execution on Ada."
    return True, ""

MXFP8 support check from `transformer_engine/pytorch/quantization.py:62-68`:

@functools.lru_cache(maxsize=None)
def check_mxfp8_support() -> Tuple[bool, str]:
    if get_device_compute_capability() >= (12, 0):
        return False, "MXFP8 (for all gemm layouts) is not supported on 12.0+ architectures yet."
    if get_device_compute_capability() >= (10, 0):  # blackwell and above
        return True, ""
    return False, "Device compute capability 10.0 or higher required for MXFP8 execution."

NVFP4 support check from `transformer_engine/pytorch/quantization.py:72-76`:

@functools.lru_cache(maxsize=None)
def check_nvfp4_support() -> Tuple[bool, str]:
    if get_device_compute_capability() >= (10, 0):  # blackwell and above
        return True, ""
    return False, "Device compute capability 10.0 or higher required for NVFP4 execution."

BF16 compatibility check from `transformer_engine/pytorch/utils.py:460-464`:

def is_bf16_compatible() -> bool:
    """Replaces torch.cuda.is_bf16_compatible() with an explicit
    check on device compute capability to enforce sm_80 or higher.
    """
    return torch.cuda.get_device_capability()[0] >= 8

Non-TN FP8 GEMM layout support from `transformer_engine/pytorch/utils.py:491-496`:

@functools.lru_cache(maxsize=None)
def is_non_tn_fp8_gemm_supported() -> bool:
    device_capability = torch.cuda.get_device_capability()
    return (10, 0) <= device_capability < (12, 0) or device_capability >= (13, 0)

Common Errors

Error Message	Cause	Solution
`Device compute capability 8.9 or higher required for FP8 execution.`	GPU too old for FP8	Upgrade to Ada, Hopper, or Blackwell GPU
`CublasLt version 12.1.3.x or higher required for FP8 execution on Ada.`	cuBLAS too old on Ada GPU	Upgrade CUDA Toolkit to 12.1+
`Cuda version 12.1 or higher required for FP8 execution on Ada.`	CUDA runtime too old on Ada	Upgrade CUDA to 12.1+
`Device compute capability 10.0 or higher required for MXFP8 execution.`	GPU not Blackwell	Only available on SM 10.0+ (B200)
`Device compute capability 10.0 or higher required for NVFP4 execution.`	GPU not Blackwell	Only available on SM 10.0+ (B200)
`MXFP8 (for all gemm layouts) is not supported on 12.0+ architectures yet.`	SM 12.0+ temporary limitation	Use Float8CurrentScaling instead
`BF16 support requires a GPU with compute capability 8.0 or higher.`	GPU is Volta or Turing	Upgrade to Ampere or newer

Compatibility Notes

Volta (SM 7.0): Supported for basic operations only. No BF16, no FP8.
Turing (SM 7.5): Similar to Volta. Limited feature set.
Ampere (SM 8.0): BF16 support, FlashAttention-2. No FP8.
Ada Lovelace (SM 8.9): FP8 available with extra CUDA/cuBLAS requirements. FusedAttention with KV caching is disabled due to a cuDNN bug.
Hopper (SM 9.0): Full FP8 native support, FlashAttention-3, FP8 block scaling (with CUDA 12.9+). Best balance of features.
Blackwell (SM 10.0): MXFP8 and NVFP4 quantization. Flash Attention requires v2.7.3+.
SM 12.0+: MXFP8 temporarily not supported for all GEMM layouts. Auto-falls back to Float8CurrentScaling.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment