Implementation:NVIDIA TransformerEngine DelayedScaling Recipe

Field	Value
Page Type	Implementation
Repository	NVIDIA TransformerEngine
Source File	`transformer_engine/common/recipe/__init__.py` (L121-220)
Import	`from transformer_engine.common.recipe import DelayedScaling`
Implements	Principle:NVIDIA_TransformerEngine_FP8_Delayed_Scaling
Requires Environment	Environment:NVIDIA_TransformerEngine_CUDA_Toolkit_Requirements

Overview

Concrete FP8 recipe configuration for delayed scaling provided by TransformerEngine.

Description

DelayedScaling is a frozen pydantic dataclass that configures the delayed scaling FP8 quantization strategy. It specifies the FP8 format (HYBRID by default), amax history length, computation algorithm, and margin. This recipe is passed to te.autocast(recipe=...) to enable FP8 training with delayed scaling.

As a frozen dataclass, DelayedScaling instances are immutable after construction. All configuration must be provided at instantiation time. The class inherits from Recipe, the base class for all TransformerEngine FP8 recipes.

Usage

Use DelayedScaling as the default recipe for FP8 training. It is the recommended starting point for most workloads:

Instantiate with desired parameters (or use defaults for a standard configuration).
Pass the instance to te.autocast(recipe=recipe).
The recipe is read by TE modules during their forward pass to determine scaling behavior.

Code Reference

Source Location

Attribute	Detail
File	`transformer_engine/common/recipe/__init__.py`
Class	`DelayedScaling`
Lines	L121-220
Base Class	`Recipe`

Signature

@dataclass(frozen=True)
class DelayedScaling(Recipe):
    margin: int = 0
    fp8_format: Format = Format.HYBRID
    amax_history_len: int = 1024
    amax_compute_algo: str = "max"
    reduce_amax: bool = True
    fp8_dpa: bool = False
    fp8_mha: bool = False

Key Parameters

Parameter	Type	Default	Description
`margin`	`int`	`0`	Safety margin for scaling factor computation. The scaling factor is divided by `2^margin`, reducing the effective FP8 range to provide headroom against overflow.
`fp8_format`	`Format`	`Format.HYBRID`	The FP8 format to use. `HYBRID` uses E4M3 for forward and E5M2 for backward. `E4M3` and `E5M2` use the same format for both passes.
`amax_history_len`	`int`	`1024`	Length of the amax history buffer. Larger values produce more stable but less responsive scaling factors.
`amax_compute_algo`	`str`	`"max"`	Algorithm to compute effective amax from history. Options: `"max"` (maximum over history) or `"most_recent"` (last recorded value).
`reduce_amax`	`bool`	`True`	Whether to synchronize (all-reduce) amax values across distributed ranks. Set to `True` for data-parallel or tensor-parallel training.
`fp8_dpa`	`bool`	`False`	Enable FP8 execution for dot-product attention (the QK^T computation).
`fp8_mha`	`bool`	`False`	Enable FP8 execution for the full multi-head attention block, including the attention output projection.

I/O Contract

Input

Input	Type	Description
`margin`	`int`	Scaling factor safety margin.
`fp8_format`	`Format`	FP8 format selection (HYBRID, E4M3, or E5M2).
`amax_history_len`	`int`	Size of the rolling amax history buffer.
`amax_compute_algo`	`str`	Algorithm for deriving effective amax from history.
`reduce_amax`	`bool`	Whether to all-reduce amax across distributed ranks.
`fp8_dpa`	`bool`	Enable FP8 dot-product attention.
`fp8_mha`	`bool`	Enable FP8 multi-head attention.

Output

Output	Type	Description
Recipe object	`DelayedScaling`	An immutable recipe instance passed to `te.autocast(recipe=...)`. Consumed by TE modules to configure FP8 scaling behavior during the forward and backward passes.

Usage Examples

Default Configuration

from transformer_engine.common.recipe import DelayedScaling

# All defaults: HYBRID format, history length 1024, "max" algorithm
recipe = DelayedScaling()

Custom History and Algorithm

from transformer_engine.common.recipe import DelayedScaling, Format

recipe = DelayedScaling(
    fp8_format=Format.HYBRID,
    amax_history_len=1024,
    amax_compute_algo="max",
)

With Safety Margin

from transformer_engine.common.recipe import DelayedScaling, Format

# Use margin=2 for extra headroom against overflow
recipe = DelayedScaling(
    fp8_format=Format.HYBRID,
    margin=2,
    amax_history_len=512,
    amax_compute_algo="most_recent",
)

With FP8 Attention Enabled

from transformer_engine.common.recipe import DelayedScaling

recipe = DelayedScaling(
    fp8_dpa=True,
    fp8_mha=True,
)

Full Training Loop Example

import transformer_engine.pytorch as te
from transformer_engine.common.recipe import DelayedScaling, Format

recipe = DelayedScaling(
    fp8_format=Format.HYBRID,
    amax_history_len=1024,
    amax_compute_algo="max",
)

model = te.TransformerLayer(
    hidden_size=1024,
    ffn_hidden_size=4096,
    num_attention_heads=16,
)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)

for batch in dataloader:
    optimizer.zero_grad()
    with te.autocast(enabled=True, recipe=recipe):
        output = model(batch["input"])
    loss = loss_fn(output, batch["target"])
    loss.backward()
    optimizer.step()

Related Pages

Principle:NVIDIA_TransformerEngine_FP8_Delayed_Scaling -- The principle describing delayed scaling for FP8.
Implementation:NVIDIA_TransformerEngine_TE_Autocast -- The context manager that consumes this recipe.
Implementation:NVIDIA_TransformerEngine_Float8CurrentScaling_Recipe -- The alternative current scaling recipe.
Environment:NVIDIA_TransformerEngine_CUDA_Toolkit_Requirements
Environment:NVIDIA_TransformerEngine_GPU_Compute_Capability
Heuristic:NVIDIA_TransformerEngine_FP8_Recipe_Auto_Selection

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment