Principle:NVIDIA TransformerEngine FP8 Delayed Scaling

Field	Value
Page Type	Principle
Repository	NVIDIA TransformerEngine
Domains	Deep_Learning, Quantization
Sources	TransformerEngine, FP8 Formats for Deep Learning
Implemented By	Implementation:NVIDIA_TransformerEngine_DelayedScaling_Recipe

Overview

Computing FP8 scaling factors from historical absolute maximum (amax) values for stable quantization.

Description

Delayed scaling uses a history of amax values from previous training iterations to compute scaling factors. This avoids the overhead of per-tensor amax computation in the current iteration, at the cost of slightly stale scaling factors.

The core mechanism works as follows:

During each forward pass, the amax (absolute maximum value) of each tensor being quantized to FP8 is recorded.
These amax values are stored in a rolling history buffer of configurable length.
The scaling factor for the next iteration is computed from this history buffer, using either the maximum over the entire history or the most recent value.
The computed scaling factor is applied to quantize the tensor in the next forward pass.

Because the scaling factor is derived from past iterations rather than the current tensor, there is an inherent one-step lag. In practice, this lag is negligible for most training workloads because tensor value distributions change gradually across iterations.

Usage

Use as the default FP8 recipe for most training workloads. Delayed scaling is suitable when:

Training throughput is a priority and the slight staleness of scaling factors is acceptable.
The model exhibits stable training dynamics without sudden changes in activation magnitudes.
The overhead of per-tensor amax computation (as in current scaling) is a concern.

Prefer delayed scaling over current scaling when:

Running on Hopper (H100) GPUs where current scaling introduces measurable overhead.
The training loss curve is stable and does not exhibit spikes from quantization artifacts.

Theoretical Basis

Scaling Factor Computation

The scaling factor is computed as:

scale = FP8_MAX / (amax * 2^margin)

where:

FP8_MAX is the maximum representable value in the target FP8 format (448 for E4M3, 57344 for E5M2).
amax is derived from the history buffer using the configured algorithm.
margin is an optional integer safety margin (default 0) that reduces the effective range to prevent overflow.

Amax History Buffer

The history buffer is a fixed-length circular buffer that stores the amax values from the most recent N iterations (where N = amax_history_len, default 1024). Two algorithms are available for deriving the effective amax from this buffer:

Algorithm	Formula	Behavior
`"max"`	`amax = max(history[0:N])`	Takes the maximum amax across the entire history window. More conservative -- produces larger scaling factors that are less likely to cause overflow, but may underutilize the FP8 range.
`"most_recent"`	`amax = history[-1]`	Uses only the most recent amax value. More responsive -- adapts quickly to changes in tensor distributions, but more susceptible to transient spikes.

Trade-offs

Aspect	Delayed Scaling	Current Scaling
Amax Source	Historical (previous iterations)	Current tensor (this iteration)
Overhead	No additional pass over data	Requires extra reduction per tensor
Staleness	One-step lag in scaling factors	No staleness
Stability	Smoothed by history window	Can be volatile with sudden distribution shifts
Default For	Hopper (H100) training	Blackwell (B200+) training

History Length Considerations

Short history (e.g., 1-10): More responsive to distribution changes but more volatile. Approaches current scaling behavior at length 1.
Long history (e.g., 1024, the default): More stable scaling factors but slower to adapt. Suitable for steady-state training where activations change gradually.
Very long history (e.g., 10000+): Overly conservative; may fail to capture meaningful distribution shifts during training (e.g., learning rate warmup, curriculum changes).

Related Pages

Implementation:NVIDIA_TransformerEngine_DelayedScaling_Recipe -- The concrete recipe class implementing delayed scaling configuration.
Principle:NVIDIA_TransformerEngine_FP8_Quantization -- The parent principle describing FP8 quantization in TransformerEngine.
Principle:NVIDIA_TransformerEngine_FP8_Current_Scaling -- The alternative scaling strategy using current-iteration amax values.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment