Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:NVIDIA TransformerEngine GPU Compute Capability

From Leeroopedia


Knowledge Sources
Domains Infrastructure, GPU_Computing, Deep_Learning
Last Updated 2026-02-07 21:00 GMT

Overview

GPU hardware requirements matrix defining which NVIDIA GPU architectures (SM 7.0 through SM 12.0+) enable which TransformerEngine features (BF16, FP8, MXFP8, NVFP4).

Description

TransformerEngine uses GPU compute capability to gate feature availability at runtime. The minimum supported architecture is SM 7.0 (Volta) for basic operations, SM 8.0 (Ampere) for BF16, SM 8.9 (Ada Lovelace) for FP8 with additional CUDA/cuBLAS requirements, SM 9.0 (Hopper) for full FP8 support and FlashAttention-3, and SM 10.0 (Blackwell) for MXFP8 and NVFP4 quantization. Each feature check is cached via `functools.lru_cache` for performance.

Usage

Use this environment reference to determine which TransformerEngine features are available on your GPU hardware. This is critical for selecting the correct FP8 recipe, attention backend, and quantization strategy.

System Requirements

Category Requirement Notes
GPU (Minimum) NVIDIA SM 7.0 (Volta V100) Basic TE operations, no FP8
GPU (BF16) NVIDIA SM 8.0+ (Ampere A100) BF16 training support
GPU (FP8) NVIDIA SM 8.9+ (Ada L40/L4) Requires CUDA 12.1+ and cuBLASLt 12.1.3+
GPU (Full FP8) NVIDIA SM 9.0+ (Hopper H100) Full native FP8 support
GPU (MXFP8/NVFP4) NVIDIA SM 10.0+ (Blackwell B200) Advanced quantization formats
VRAM Varies by model Minimum 16GB recommended for training

Dependencies

Feature Availability Matrix

Feature Min SM Additional Requirements GPU Examples
Basic Operations 7.0 None V100
BF16 Compute 8.0 None A100, A30
FlashAttention-2 8.0 `flash-attn` >= 2.1.1 A100, L40, H100
FP8 (Ada) 8.9 CUDA >= 12.1, cuBLASLt >= 12.1.3 L40, L4, RTX 4090
FP8 (Native) 9.0 None H100, H200
FlashAttention-3 9.0 `flash_attn_3` package H100, H200
FP8 Block Scaling 9.0 CUDA >= 12.9 H100, H200
MXFP8 10.0 Not on SM 12.0+ yet B200, GB200
NVFP4 10.0 None B200, GB200
Non-TN FP8 GEMM 10.0-11.x or 13.0+ Architecture-specific B200+

Credentials

No credentials required.

Quick Install

# Check your GPU compute capability
python -c "import torch; print(torch.cuda.get_device_capability())"

# Check if FP8 is available
python -c "from transformer_engine.pytorch import is_fp8_available; print(is_fp8_available())"

# Check if MXFP8 is available
python -c "from transformer_engine.pytorch import is_mxfp8_available; print(is_mxfp8_available())"

Code Evidence

FP8 support check from `transformer_engine/pytorch/quantization.py:48-58`:

@functools.lru_cache(maxsize=None)
def check_fp8_support() -> Tuple[bool, str]:
    """Return if fp8 support is available"""
    if get_device_compute_capability() >= (9, 0):  # hopper and above
        return True, ""
    if get_device_compute_capability() < (8, 9):  # pre-ada
        return False, "Device compute capability 8.9 or higher required for FP8 execution."
    if tex.get_cublasLt_version() < 120103:
        return False, "CublasLt version 12.1.3.x or higher required for FP8 execution on Ada."
    if float(torch.version.cuda) < 12.1:
        return False, "Cuda version 12.1 or higher required for FP8 execution on Ada."
    return True, ""

MXFP8 support check from `transformer_engine/pytorch/quantization.py:62-68`:

@functools.lru_cache(maxsize=None)
def check_mxfp8_support() -> Tuple[bool, str]:
    if get_device_compute_capability() >= (12, 0):
        return False, "MXFP8 (for all gemm layouts) is not supported on 12.0+ architectures yet."
    if get_device_compute_capability() >= (10, 0):  # blackwell and above
        return True, ""
    return False, "Device compute capability 10.0 or higher required for MXFP8 execution."

NVFP4 support check from `transformer_engine/pytorch/quantization.py:72-76`:

@functools.lru_cache(maxsize=None)
def check_nvfp4_support() -> Tuple[bool, str]:
    if get_device_compute_capability() >= (10, 0):  # blackwell and above
        return True, ""
    return False, "Device compute capability 10.0 or higher required for NVFP4 execution."

BF16 compatibility check from `transformer_engine/pytorch/utils.py:460-464`:

def is_bf16_compatible() -> bool:
    """Replaces torch.cuda.is_bf16_compatible() with an explicit
    check on device compute capability to enforce sm_80 or higher.
    """
    return torch.cuda.get_device_capability()[0] >= 8

Non-TN FP8 GEMM layout support from `transformer_engine/pytorch/utils.py:491-496`:

@functools.lru_cache(maxsize=None)
def is_non_tn_fp8_gemm_supported() -> bool:
    device_capability = torch.cuda.get_device_capability()
    return (10, 0) <= device_capability < (12, 0) or device_capability >= (13, 0)

Common Errors

Error Message Cause Solution
`Device compute capability 8.9 or higher required for FP8 execution.` GPU too old for FP8 Upgrade to Ada, Hopper, or Blackwell GPU
`CublasLt version 12.1.3.x or higher required for FP8 execution on Ada.` cuBLAS too old on Ada GPU Upgrade CUDA Toolkit to 12.1+
`Cuda version 12.1 or higher required for FP8 execution on Ada.` CUDA runtime too old on Ada Upgrade CUDA to 12.1+
`Device compute capability 10.0 or higher required for MXFP8 execution.` GPU not Blackwell Only available on SM 10.0+ (B200)
`Device compute capability 10.0 or higher required for NVFP4 execution.` GPU not Blackwell Only available on SM 10.0+ (B200)
`MXFP8 (for all gemm layouts) is not supported on 12.0+ architectures yet.` SM 12.0+ temporary limitation Use Float8CurrentScaling instead
`BF16 support requires a GPU with compute capability 8.0 or higher.` GPU is Volta or Turing Upgrade to Ampere or newer

Compatibility Notes

  • Volta (SM 7.0): Supported for basic operations only. No BF16, no FP8.
  • Turing (SM 7.5): Similar to Volta. Limited feature set.
  • Ampere (SM 8.0): BF16 support, FlashAttention-2. No FP8.
  • Ada Lovelace (SM 8.9): FP8 available with extra CUDA/cuBLAS requirements. FusedAttention with KV caching is disabled due to a cuDNN bug.
  • Hopper (SM 9.0): Full FP8 native support, FlashAttention-3, FP8 block scaling (with CUDA 12.9+). Best balance of features.
  • Blackwell (SM 10.0): MXFP8 and NVFP4 quantization. Flash Attention requires v2.7.3+.
  • SM 12.0+: MXFP8 temporarily not supported for all GEMM layouts. Auto-falls back to Float8CurrentScaling.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment