Environment:NVIDIA TransformerEngine GPU Compute Capability
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, GPU_Computing, Deep_Learning |
| Last Updated | 2026-02-07 21:00 GMT |
Overview
GPU hardware requirements matrix defining which NVIDIA GPU architectures (SM 7.0 through SM 12.0+) enable which TransformerEngine features (BF16, FP8, MXFP8, NVFP4).
Description
TransformerEngine uses GPU compute capability to gate feature availability at runtime. The minimum supported architecture is SM 7.0 (Volta) for basic operations, SM 8.0 (Ampere) for BF16, SM 8.9 (Ada Lovelace) for FP8 with additional CUDA/cuBLAS requirements, SM 9.0 (Hopper) for full FP8 support and FlashAttention-3, and SM 10.0 (Blackwell) for MXFP8 and NVFP4 quantization. Each feature check is cached via `functools.lru_cache` for performance.
Usage
Use this environment reference to determine which TransformerEngine features are available on your GPU hardware. This is critical for selecting the correct FP8 recipe, attention backend, and quantization strategy.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| GPU (Minimum) | NVIDIA SM 7.0 (Volta V100) | Basic TE operations, no FP8 |
| GPU (BF16) | NVIDIA SM 8.0+ (Ampere A100) | BF16 training support |
| GPU (FP8) | NVIDIA SM 8.9+ (Ada L40/L4) | Requires CUDA 12.1+ and cuBLASLt 12.1.3+ |
| GPU (Full FP8) | NVIDIA SM 9.0+ (Hopper H100) | Full native FP8 support |
| GPU (MXFP8/NVFP4) | NVIDIA SM 10.0+ (Blackwell B200) | Advanced quantization formats |
| VRAM | Varies by model | Minimum 16GB recommended for training |
Dependencies
Feature Availability Matrix
| Feature | Min SM | Additional Requirements | GPU Examples |
|---|---|---|---|
| Basic Operations | 7.0 | None | V100 |
| BF16 Compute | 8.0 | None | A100, A30 |
| FlashAttention-2 | 8.0 | `flash-attn` >= 2.1.1 | A100, L40, H100 |
| FP8 (Ada) | 8.9 | CUDA >= 12.1, cuBLASLt >= 12.1.3 | L40, L4, RTX 4090 |
| FP8 (Native) | 9.0 | None | H100, H200 |
| FlashAttention-3 | 9.0 | `flash_attn_3` package | H100, H200 |
| FP8 Block Scaling | 9.0 | CUDA >= 12.9 | H100, H200 |
| MXFP8 | 10.0 | Not on SM 12.0+ yet | B200, GB200 |
| NVFP4 | 10.0 | None | B200, GB200 |
| Non-TN FP8 GEMM | 10.0-11.x or 13.0+ | Architecture-specific | B200+ |
Credentials
No credentials required.
Quick Install
# Check your GPU compute capability
python -c "import torch; print(torch.cuda.get_device_capability())"
# Check if FP8 is available
python -c "from transformer_engine.pytorch import is_fp8_available; print(is_fp8_available())"
# Check if MXFP8 is available
python -c "from transformer_engine.pytorch import is_mxfp8_available; print(is_mxfp8_available())"
Code Evidence
FP8 support check from `transformer_engine/pytorch/quantization.py:48-58`:
@functools.lru_cache(maxsize=None)
def check_fp8_support() -> Tuple[bool, str]:
"""Return if fp8 support is available"""
if get_device_compute_capability() >= (9, 0): # hopper and above
return True, ""
if get_device_compute_capability() < (8, 9): # pre-ada
return False, "Device compute capability 8.9 or higher required for FP8 execution."
if tex.get_cublasLt_version() < 120103:
return False, "CublasLt version 12.1.3.x or higher required for FP8 execution on Ada."
if float(torch.version.cuda) < 12.1:
return False, "Cuda version 12.1 or higher required for FP8 execution on Ada."
return True, ""
MXFP8 support check from `transformer_engine/pytorch/quantization.py:62-68`:
@functools.lru_cache(maxsize=None)
def check_mxfp8_support() -> Tuple[bool, str]:
if get_device_compute_capability() >= (12, 0):
return False, "MXFP8 (for all gemm layouts) is not supported on 12.0+ architectures yet."
if get_device_compute_capability() >= (10, 0): # blackwell and above
return True, ""
return False, "Device compute capability 10.0 or higher required for MXFP8 execution."
NVFP4 support check from `transformer_engine/pytorch/quantization.py:72-76`:
@functools.lru_cache(maxsize=None)
def check_nvfp4_support() -> Tuple[bool, str]:
if get_device_compute_capability() >= (10, 0): # blackwell and above
return True, ""
return False, "Device compute capability 10.0 or higher required for NVFP4 execution."
BF16 compatibility check from `transformer_engine/pytorch/utils.py:460-464`:
def is_bf16_compatible() -> bool:
"""Replaces torch.cuda.is_bf16_compatible() with an explicit
check on device compute capability to enforce sm_80 or higher.
"""
return torch.cuda.get_device_capability()[0] >= 8
Non-TN FP8 GEMM layout support from `transformer_engine/pytorch/utils.py:491-496`:
@functools.lru_cache(maxsize=None)
def is_non_tn_fp8_gemm_supported() -> bool:
device_capability = torch.cuda.get_device_capability()
return (10, 0) <= device_capability < (12, 0) or device_capability >= (13, 0)
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `Device compute capability 8.9 or higher required for FP8 execution.` | GPU too old for FP8 | Upgrade to Ada, Hopper, or Blackwell GPU |
| `CublasLt version 12.1.3.x or higher required for FP8 execution on Ada.` | cuBLAS too old on Ada GPU | Upgrade CUDA Toolkit to 12.1+ |
| `Cuda version 12.1 or higher required for FP8 execution on Ada.` | CUDA runtime too old on Ada | Upgrade CUDA to 12.1+ |
| `Device compute capability 10.0 or higher required for MXFP8 execution.` | GPU not Blackwell | Only available on SM 10.0+ (B200) |
| `Device compute capability 10.0 or higher required for NVFP4 execution.` | GPU not Blackwell | Only available on SM 10.0+ (B200) |
| `MXFP8 (for all gemm layouts) is not supported on 12.0+ architectures yet.` | SM 12.0+ temporary limitation | Use Float8CurrentScaling instead |
| `BF16 support requires a GPU with compute capability 8.0 or higher.` | GPU is Volta or Turing | Upgrade to Ampere or newer |
Compatibility Notes
- Volta (SM 7.0): Supported for basic operations only. No BF16, no FP8.
- Turing (SM 7.5): Similar to Volta. Limited feature set.
- Ampere (SM 8.0): BF16 support, FlashAttention-2. No FP8.
- Ada Lovelace (SM 8.9): FP8 available with extra CUDA/cuBLAS requirements. FusedAttention with KV caching is disabled due to a cuDNN bug.
- Hopper (SM 9.0): Full FP8 native support, FlashAttention-3, FP8 block scaling (with CUDA 12.9+). Best balance of features.
- Blackwell (SM 10.0): MXFP8 and NVFP4 quantization. Flash Attention requires v2.7.3+.
- SM 12.0+: MXFP8 temporarily not supported for all GEMM layouts. Auto-falls back to Float8CurrentScaling.
Related Pages
- Implementation:NVIDIA_TransformerEngine_TE_Autocast
- Implementation:NVIDIA_TransformerEngine_DelayedScaling_Recipe
- Implementation:NVIDIA_TransformerEngine_Float8CurrentScaling_Recipe
- Implementation:NVIDIA_TransformerEngine_TE_DotProductAttention
- Implementation:NVIDIA_TransformerEngine_TE_TransformerLayer