Heuristic:NVIDIA TransformerEngine FP8 Recipe Auto Selection

Knowledge Sources	NVIDIA TransformerEngine TE Quantization Module
Domains	Optimization, LLMs, FP8_Training
Last Updated	2026-02-07 21:00 GMT

Overview

Automatic FP8 recipe selection that chooses MXFP8BlockScaling, Float8CurrentScaling, or DelayedScaling based on GPU compute capability for optimal precision and performance.

Description

TransformerEngine provides a `get_default_fp8_recipe()` function that automatically selects the best FP8 quantization recipe for the current GPU. This eliminates guesswork and ensures users get the highest-performing recipe their hardware supports. The selection follows a priority order: MXFP8 (Blackwell SM 10.0) > CurrentScaling (SM 12.0+ fallback) > DelayedScaling (Hopper/Ada fallback). Users can override the auto-selection by explicitly passing a recipe to `te.autocast()`.

Usage

Use this heuristic when configuring FP8 training and you are unsure which recipe to use. The default recipe is automatically applied when calling `te.autocast(enabled=True)` without specifying a recipe. Override only when you have specific requirements (e.g., determinism, specific amax history behavior).

The Insight (Rule of Thumb)

Action: Call `te.autocast(enabled=True)` without specifying a recipe, or call `get_default_fp8_recipe()` to see which recipe would be selected.
Value: The auto-selection hierarchy is:
- SM 10.0-11.x (Blackwell): `MXFP8BlockScaling()` — microscaling FP8 with block-level quantization
- SM 12.0+ (future arch): `Float8CurrentScaling()` — temporary fallback until MXFP8 supports all GEMM layouts
- SM 9.0 (Hopper) / SM 8.9 (Ada): `DelayedScaling()` — classic delayed scaling with amax history
Trade-off: MXFP8 offers finer granularity (per-block) scaling but is only available on Blackwell. DelayedScaling is the most widely supported but uses a global amax history that may be less optimal.
Override: Users can always pass an explicit recipe: `te.autocast(enabled=True, recipe=DelayedScaling(amax_history_len=32))`

Reasoning

Each generation of NVIDIA GPU introduces improved quantization hardware. Blackwell GPUs have native hardware support for microscaling (MXFP8), which provides per-32-element scaling granularity instead of per-tensor scaling. This dramatically improves numerical accuracy while maintaining FP8 throughput. The auto-selection ensures users always get the best recipe for their hardware without needing to track which GPU supports which format.

The temporary SM 12.0+ exception for MXFP8 exists because MXFP8 does not yet support all GEMM layouts on those architectures. TransformerEngine falls back to Float8CurrentScaling, which still provides good performance with per-tensor current-step scaling (no amax history needed).

Code Evidence

Auto-selection logic from `transformer_engine/pytorch/quantization.py:103-110`:

def get_default_fp8_recipe() -> Recipe:
    """FP8 recipe with default args."""
    if check_mxfp8_support()[0]:
        return MXFP8BlockScaling()
    if get_device_compute_capability() >= (12, 0):
        # This is a temporary restriction until MXFP8 is supported for all gemm layouts.
        return Float8CurrentScaling()
    return DelayedScaling()

Alignment size varies by recipe from `transformer_engine/pytorch/quantization.py:118-124`:

def get_align_size_for_quantization(recipe: Recipe) -> int:
    """Get the alignment size for quantization."""
    if recipe.mxfp8():
        return 32
    if recipe.nvfp4():
        return 128
    return 16

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment