Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:NVIDIA TransformerEngine DelayedScaling Recipe

From Leeroopedia


Field Value
Page Type Implementation
Repository NVIDIA TransformerEngine
Source File transformer_engine/common/recipe/__init__.py (L121-220)
Import from transformer_engine.common.recipe import DelayedScaling
Implements Principle:NVIDIA_TransformerEngine_FP8_Delayed_Scaling
Requires Environment Environment:NVIDIA_TransformerEngine_CUDA_Toolkit_Requirements

Overview

Concrete FP8 recipe configuration for delayed scaling provided by TransformerEngine.

Description

DelayedScaling is a frozen pydantic dataclass that configures the delayed scaling FP8 quantization strategy. It specifies the FP8 format (HYBRID by default), amax history length, computation algorithm, and margin. This recipe is passed to te.autocast(recipe=...) to enable FP8 training with delayed scaling.

As a frozen dataclass, DelayedScaling instances are immutable after construction. All configuration must be provided at instantiation time. The class inherits from Recipe, the base class for all TransformerEngine FP8 recipes.

Usage

Use DelayedScaling as the default recipe for FP8 training. It is the recommended starting point for most workloads:

  • Instantiate with desired parameters (or use defaults for a standard configuration).
  • Pass the instance to te.autocast(recipe=recipe).
  • The recipe is read by TE modules during their forward pass to determine scaling behavior.

Code Reference

Source Location

Attribute Detail
File transformer_engine/common/recipe/__init__.py
Class DelayedScaling
Lines L121-220
Base Class Recipe

Signature

@dataclass(frozen=True)
class DelayedScaling(Recipe):
    margin: int = 0
    fp8_format: Format = Format.HYBRID
    amax_history_len: int = 1024
    amax_compute_algo: str = "max"
    reduce_amax: bool = True
    fp8_dpa: bool = False
    fp8_mha: bool = False

Key Parameters

Parameter Type Default Description
margin int 0 Safety margin for scaling factor computation. The scaling factor is divided by 2^margin, reducing the effective FP8 range to provide headroom against overflow.
fp8_format Format Format.HYBRID The FP8 format to use. HYBRID uses E4M3 for forward and E5M2 for backward. E4M3 and E5M2 use the same format for both passes.
amax_history_len int 1024 Length of the amax history buffer. Larger values produce more stable but less responsive scaling factors.
amax_compute_algo str "max" Algorithm to compute effective amax from history. Options: "max" (maximum over history) or "most_recent" (last recorded value).
reduce_amax bool True Whether to synchronize (all-reduce) amax values across distributed ranks. Set to True for data-parallel or tensor-parallel training.
fp8_dpa bool False Enable FP8 execution for dot-product attention (the QK^T computation).
fp8_mha bool False Enable FP8 execution for the full multi-head attention block, including the attention output projection.

I/O Contract

Input

Input Type Description
margin int Scaling factor safety margin.
fp8_format Format FP8 format selection (HYBRID, E4M3, or E5M2).
amax_history_len int Size of the rolling amax history buffer.
amax_compute_algo str Algorithm for deriving effective amax from history.
reduce_amax bool Whether to all-reduce amax across distributed ranks.
fp8_dpa bool Enable FP8 dot-product attention.
fp8_mha bool Enable FP8 multi-head attention.

Output

Output Type Description
Recipe object DelayedScaling An immutable recipe instance passed to te.autocast(recipe=...). Consumed by TE modules to configure FP8 scaling behavior during the forward and backward passes.

Usage Examples

Default Configuration

from transformer_engine.common.recipe import DelayedScaling

# All defaults: HYBRID format, history length 1024, "max" algorithm
recipe = DelayedScaling()

Custom History and Algorithm

from transformer_engine.common.recipe import DelayedScaling, Format

recipe = DelayedScaling(
    fp8_format=Format.HYBRID,
    amax_history_len=1024,
    amax_compute_algo="max",
)

With Safety Margin

from transformer_engine.common.recipe import DelayedScaling, Format

# Use margin=2 for extra headroom against overflow
recipe = DelayedScaling(
    fp8_format=Format.HYBRID,
    margin=2,
    amax_history_len=512,
    amax_compute_algo="most_recent",
)

With FP8 Attention Enabled

from transformer_engine.common.recipe import DelayedScaling

recipe = DelayedScaling(
    fp8_dpa=True,
    fp8_mha=True,
)

Full Training Loop Example

import transformer_engine.pytorch as te
from transformer_engine.common.recipe import DelayedScaling, Format

recipe = DelayedScaling(
    fp8_format=Format.HYBRID,
    amax_history_len=1024,
    amax_compute_algo="max",
)

model = te.TransformerLayer(
    hidden_size=1024,
    ffn_hidden_size=4096,
    num_attention_heads=16,
)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)

for batch in dataloader:
    optimizer.zero_grad()
    with te.autocast(enabled=True, recipe=recipe):
        output = model(batch["input"])
    loss = loss_fn(output, batch["target"])
    loss.backward()
    optimizer.step()

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment