Principle:NVIDIA TransformerEngine FP8 Delayed Scaling
| Field | Value |
|---|---|
| Page Type | Principle |
| Repository | NVIDIA TransformerEngine |
| Domains | Deep_Learning, Quantization |
| Sources | TransformerEngine, FP8 Formats for Deep Learning |
| Implemented By | Implementation:NVIDIA_TransformerEngine_DelayedScaling_Recipe |
Overview
Computing FP8 scaling factors from historical absolute maximum (amax) values for stable quantization.
Description
Delayed scaling uses a history of amax values from previous training iterations to compute scaling factors. This avoids the overhead of per-tensor amax computation in the current iteration, at the cost of slightly stale scaling factors.
The core mechanism works as follows:
- During each forward pass, the amax (absolute maximum value) of each tensor being quantized to FP8 is recorded.
- These amax values are stored in a rolling history buffer of configurable length.
- The scaling factor for the next iteration is computed from this history buffer, using either the maximum over the entire history or the most recent value.
- The computed scaling factor is applied to quantize the tensor in the next forward pass.
Because the scaling factor is derived from past iterations rather than the current tensor, there is an inherent one-step lag. In practice, this lag is negligible for most training workloads because tensor value distributions change gradually across iterations.
Usage
Use as the default FP8 recipe for most training workloads. Delayed scaling is suitable when:
- Training throughput is a priority and the slight staleness of scaling factors is acceptable.
- The model exhibits stable training dynamics without sudden changes in activation magnitudes.
- The overhead of per-tensor amax computation (as in current scaling) is a concern.
Prefer delayed scaling over current scaling when:
- Running on Hopper (H100) GPUs where current scaling introduces measurable overhead.
- The training loss curve is stable and does not exhibit spikes from quantization artifacts.
Theoretical Basis
Scaling Factor Computation
The scaling factor is computed as:
scale = FP8_MAX / (amax * 2^margin)
where:
FP8_MAXis the maximum representable value in the target FP8 format (448 for E4M3, 57344 for E5M2).amaxis derived from the history buffer using the configured algorithm.marginis an optional integer safety margin (default 0) that reduces the effective range to prevent overflow.
Amax History Buffer
The history buffer is a fixed-length circular buffer that stores the amax values from the most recent N iterations (where N = amax_history_len, default 1024). Two algorithms are available for deriving the effective amax from this buffer:
| Algorithm | Formula | Behavior |
|---|---|---|
"max" |
amax = max(history[0:N]) |
Takes the maximum amax across the entire history window. More conservative -- produces larger scaling factors that are less likely to cause overflow, but may underutilize the FP8 range. |
"most_recent" |
amax = history[-1] |
Uses only the most recent amax value. More responsive -- adapts quickly to changes in tensor distributions, but more susceptible to transient spikes. |
Trade-offs
| Aspect | Delayed Scaling | Current Scaling |
|---|---|---|
| Amax Source | Historical (previous iterations) | Current tensor (this iteration) |
| Overhead | No additional pass over data | Requires extra reduction per tensor |
| Staleness | One-step lag in scaling factors | No staleness |
| Stability | Smoothed by history window | Can be volatile with sudden distribution shifts |
| Default For | Hopper (H100) training | Blackwell (B200+) training |
History Length Considerations
- Short history (e.g., 1-10): More responsive to distribution changes but more volatile. Approaches current scaling behavior at length 1.
- Long history (e.g., 1024, the default): More stable scaling factors but slower to adapt. Suitable for steady-state training where activations change gradually.
- Very long history (e.g., 10000+): Overly conservative; may fail to capture meaningful distribution shifts during training (e.g., learning rate warmup, curriculum changes).
Related Pages
- Implementation:NVIDIA_TransformerEngine_DelayedScaling_Recipe -- The concrete recipe class implementing delayed scaling configuration.
- Principle:NVIDIA_TransformerEngine_FP8_Quantization -- The parent principle describing FP8 quantization in TransformerEngine.
- Principle:NVIDIA_TransformerEngine_FP8_Current_Scaling -- The alternative scaling strategy using current-iteration amax values.