Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:NVIDIA TransformerEngine FP8 Delayed Scaling

From Leeroopedia


Field Value
Page Type Principle
Repository NVIDIA TransformerEngine
Domains Deep_Learning, Quantization
Sources TransformerEngine, FP8 Formats for Deep Learning
Implemented By Implementation:NVIDIA_TransformerEngine_DelayedScaling_Recipe

Overview

Computing FP8 scaling factors from historical absolute maximum (amax) values for stable quantization.

Description

Delayed scaling uses a history of amax values from previous training iterations to compute scaling factors. This avoids the overhead of per-tensor amax computation in the current iteration, at the cost of slightly stale scaling factors.

The core mechanism works as follows:

  1. During each forward pass, the amax (absolute maximum value) of each tensor being quantized to FP8 is recorded.
  2. These amax values are stored in a rolling history buffer of configurable length.
  3. The scaling factor for the next iteration is computed from this history buffer, using either the maximum over the entire history or the most recent value.
  4. The computed scaling factor is applied to quantize the tensor in the next forward pass.

Because the scaling factor is derived from past iterations rather than the current tensor, there is an inherent one-step lag. In practice, this lag is negligible for most training workloads because tensor value distributions change gradually across iterations.

Usage

Use as the default FP8 recipe for most training workloads. Delayed scaling is suitable when:

  • Training throughput is a priority and the slight staleness of scaling factors is acceptable.
  • The model exhibits stable training dynamics without sudden changes in activation magnitudes.
  • The overhead of per-tensor amax computation (as in current scaling) is a concern.

Prefer delayed scaling over current scaling when:

  • Running on Hopper (H100) GPUs where current scaling introduces measurable overhead.
  • The training loss curve is stable and does not exhibit spikes from quantization artifacts.

Theoretical Basis

Scaling Factor Computation

The scaling factor is computed as:

scale = FP8_MAX / (amax * 2^margin)

where:

  • FP8_MAX is the maximum representable value in the target FP8 format (448 for E4M3, 57344 for E5M2).
  • amax is derived from the history buffer using the configured algorithm.
  • margin is an optional integer safety margin (default 0) that reduces the effective range to prevent overflow.

Amax History Buffer

The history buffer is a fixed-length circular buffer that stores the amax values from the most recent N iterations (where N = amax_history_len, default 1024). Two algorithms are available for deriving the effective amax from this buffer:

Algorithm Formula Behavior
"max" amax = max(history[0:N]) Takes the maximum amax across the entire history window. More conservative -- produces larger scaling factors that are less likely to cause overflow, but may underutilize the FP8 range.
"most_recent" amax = history[-1] Uses only the most recent amax value. More responsive -- adapts quickly to changes in tensor distributions, but more susceptible to transient spikes.

Trade-offs

Aspect Delayed Scaling Current Scaling
Amax Source Historical (previous iterations) Current tensor (this iteration)
Overhead No additional pass over data Requires extra reduction per tensor
Staleness One-step lag in scaling factors No staleness
Stability Smoothed by history window Can be volatile with sudden distribution shifts
Default For Hopper (H100) training Blackwell (B200+) training

History Length Considerations

  • Short history (e.g., 1-10): More responsive to distribution changes but more volatile. Approaches current scaling behavior at length 1.
  • Long history (e.g., 1024, the default): More stable scaling factors but slower to adapt. Suitable for steady-state training where activations change gradually.
  • Very long history (e.g., 10000+): Overly conservative; may fail to capture meaningful distribution shifts during training (e.g., learning rate warmup, curriculum changes).

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment