Heuristic:NVIDIA TransformerEngine FP8 Recipe Auto Selection
| Knowledge Sources | |
|---|---|
| Domains | Optimization, LLMs, FP8_Training |
| Last Updated | 2026-02-07 21:00 GMT |
Overview
Automatic FP8 recipe selection that chooses MXFP8BlockScaling, Float8CurrentScaling, or DelayedScaling based on GPU compute capability for optimal precision and performance.
Description
TransformerEngine provides a `get_default_fp8_recipe()` function that automatically selects the best FP8 quantization recipe for the current GPU. This eliminates guesswork and ensures users get the highest-performing recipe their hardware supports. The selection follows a priority order: MXFP8 (Blackwell SM 10.0) > CurrentScaling (SM 12.0+ fallback) > DelayedScaling (Hopper/Ada fallback). Users can override the auto-selection by explicitly passing a recipe to `te.autocast()`.
Usage
Use this heuristic when configuring FP8 training and you are unsure which recipe to use. The default recipe is automatically applied when calling `te.autocast(enabled=True)` without specifying a recipe. Override only when you have specific requirements (e.g., determinism, specific amax history behavior).
The Insight (Rule of Thumb)
- Action: Call `te.autocast(enabled=True)` without specifying a recipe, or call `get_default_fp8_recipe()` to see which recipe would be selected.
- Value: The auto-selection hierarchy is:
- SM 10.0-11.x (Blackwell): `MXFP8BlockScaling()` — microscaling FP8 with block-level quantization
- SM 12.0+ (future arch): `Float8CurrentScaling()` — temporary fallback until MXFP8 supports all GEMM layouts
- SM 9.0 (Hopper) / SM 8.9 (Ada): `DelayedScaling()` — classic delayed scaling with amax history
- Trade-off: MXFP8 offers finer granularity (per-block) scaling but is only available on Blackwell. DelayedScaling is the most widely supported but uses a global amax history that may be less optimal.
- Override: Users can always pass an explicit recipe: `te.autocast(enabled=True, recipe=DelayedScaling(amax_history_len=32))`
Reasoning
Each generation of NVIDIA GPU introduces improved quantization hardware. Blackwell GPUs have native hardware support for microscaling (MXFP8), which provides per-32-element scaling granularity instead of per-tensor scaling. This dramatically improves numerical accuracy while maintaining FP8 throughput. The auto-selection ensures users always get the best recipe for their hardware without needing to track which GPU supports which format.
The temporary SM 12.0+ exception for MXFP8 exists because MXFP8 does not yet support all GEMM layouts on those architectures. TransformerEngine falls back to Float8CurrentScaling, which still provides good performance with per-tensor current-step scaling (no amax history needed).
Code Evidence
Auto-selection logic from `transformer_engine/pytorch/quantization.py:103-110`:
def get_default_fp8_recipe() -> Recipe:
"""FP8 recipe with default args."""
if check_mxfp8_support()[0]:
return MXFP8BlockScaling()
if get_device_compute_capability() >= (12, 0):
# This is a temporary restriction until MXFP8 is supported for all gemm layouts.
return Float8CurrentScaling()
return DelayedScaling()
Alignment size varies by recipe from `transformer_engine/pytorch/quantization.py:118-124`:
def get_align_size_for_quantization(recipe: Recipe) -> int:
"""Get the alignment size for quantization."""
if recipe.mxfp8():
return 32
if recipe.nvfp4():
return 128
return 16