Principle:NVIDIA TransformerEngine FP8 Current Scaling

Field	Value
Page Type	Principle
Repository	NVIDIA TransformerEngine
Domains	Deep_Learning, Quantization
Sources	TransformerEngine, FP8 Formats for Deep Learning
Implemented By	Implementation:NVIDIA_TransformerEngine_Float8CurrentScaling_Recipe

Overview

Computing FP8 scaling factors from the current iteration's absolute maximum for more precise quantization.

Description

Current scaling computes the amax of each tensor in the current forward pass and uses it to determine the scaling factor immediately. This provides more accurate scaling at the cost of an additional reduction operation per tensor per iteration.

Unlike delayed scaling, which derives scaling factors from a history buffer of past amax values, current scaling operates on the live tensor data. The process for each quantized tensor is:

Compute the absolute maximum (amax) of the tensor.
Derive the scaling factor: scale = FP8_MAX / amax.
Apply the scaling factor and quantize the tensor to FP8.

This eliminates the one-step staleness inherent in delayed scaling, ensuring that the scaling factor is always optimally matched to the current tensor distribution. The trade-off is the additional overhead of the amax reduction operation, which requires a full pass over the tensor data before quantization can proceed.

Usage

Use current scaling when:

Training accuracy is more important than throughput: Current scaling produces tighter quantization with less wasted dynamic range.
Delayed scaling produces unstable training: If the model exhibits loss spikes or divergence with delayed scaling, current scaling can provide more stable behavior.
Running on Blackwell+ GPUs: On newer GPU architectures, the overhead of per-tensor amax computation is minimal due to hardware improvements, making current scaling the preferred default.
Fine-tuning sensitive models: When fine-tuning models where small quantization errors can compound, current scaling provides better fidelity.

Avoid current scaling when:

The per-tensor amax overhead measurably reduces throughput (primarily on Hopper GPUs).
Delayed scaling already produces satisfactory training quality.

Theoretical Basis

Scaling Factor Computation

The scaling factor for current scaling is computed as:

scale = FP8_MAX / amax_current

where:

FP8_MAX is the maximum representable value in the target FP8 format (448 for E4M3, 57344 for E5M2).
amax_current is the absolute maximum value of the tensor being quantized, computed in the current iteration.

There is no history buffer and no margin parameter. The scaling factor is derived directly from the current tensor.

Precision Advantage

Current scaling achieves higher effective precision than delayed scaling because:

No staleness: The scaling factor exactly matches the current tensor distribution. With delayed scaling, the tensor distribution may have shifted since the amax was recorded, causing either overflow (if values grew) or underutilization of the FP8 range (if values shrank).
Tighter dynamic range mapping: Without a history buffer applying a conservative "max over history" policy, the scaling factor can more tightly map the tensor's actual range into the FP8 representable range.

Power-of-2 Scales

Current scaling optionally supports power-of-2 scaling factors, where the computed scale is rounded to the nearest power of 2. This is beneficial because:

Power-of-2 multiplication can be implemented as a bit shift, which is faster on some hardware paths.
It avoids potential rounding errors when the scale is applied as a floating-point multiplication.
The cost is at most a factor-of-2 reduction in effective dynamic range utilization, which is typically acceptable.

Comparison with Delayed Scaling

Aspect	Current Scaling	Delayed Scaling
Amax Source	Current tensor (this iteration)	Historical (previous iterations)
Staleness	None	One-step lag
Overhead	Extra amax reduction per tensor	None beyond normal forward pass
History Buffer	Not required	Required (configurable length)
Margin Parameter	Not applicable	Configurable
Best For	Accuracy-sensitive workloads, Blackwell+ GPUs	Throughput-sensitive workloads, Hopper GPUs

Related Pages

Implementation:NVIDIA_TransformerEngine_Float8CurrentScaling_Recipe -- The concrete recipe class implementing current scaling configuration.
Principle:NVIDIA_TransformerEngine_FP8_Quantization -- The parent principle describing FP8 quantization in TransformerEngine.
Principle:NVIDIA_TransformerEngine_FP8_Delayed_Scaling -- The alternative scaling strategy using historical amax values.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment