Principle:NVIDIA TransformerEngine FP8 Current Scaling
| Field | Value |
|---|---|
| Page Type | Principle |
| Repository | NVIDIA TransformerEngine |
| Domains | Deep_Learning, Quantization |
| Sources | TransformerEngine, FP8 Formats for Deep Learning |
| Implemented By | Implementation:NVIDIA_TransformerEngine_Float8CurrentScaling_Recipe |
Overview
Computing FP8 scaling factors from the current iteration's absolute maximum for more precise quantization.
Description
Current scaling computes the amax of each tensor in the current forward pass and uses it to determine the scaling factor immediately. This provides more accurate scaling at the cost of an additional reduction operation per tensor per iteration.
Unlike delayed scaling, which derives scaling factors from a history buffer of past amax values, current scaling operates on the live tensor data. The process for each quantized tensor is:
- Compute the absolute maximum (amax) of the tensor.
- Derive the scaling factor:
scale = FP8_MAX / amax. - Apply the scaling factor and quantize the tensor to FP8.
This eliminates the one-step staleness inherent in delayed scaling, ensuring that the scaling factor is always optimally matched to the current tensor distribution. The trade-off is the additional overhead of the amax reduction operation, which requires a full pass over the tensor data before quantization can proceed.
Usage
Use current scaling when:
- Training accuracy is more important than throughput: Current scaling produces tighter quantization with less wasted dynamic range.
- Delayed scaling produces unstable training: If the model exhibits loss spikes or divergence with delayed scaling, current scaling can provide more stable behavior.
- Running on Blackwell+ GPUs: On newer GPU architectures, the overhead of per-tensor amax computation is minimal due to hardware improvements, making current scaling the preferred default.
- Fine-tuning sensitive models: When fine-tuning models where small quantization errors can compound, current scaling provides better fidelity.
Avoid current scaling when:
- The per-tensor amax overhead measurably reduces throughput (primarily on Hopper GPUs).
- Delayed scaling already produces satisfactory training quality.
Theoretical Basis
Scaling Factor Computation
The scaling factor for current scaling is computed as:
scale = FP8_MAX / amax_current
where:
FP8_MAXis the maximum representable value in the target FP8 format (448 for E4M3, 57344 for E5M2).amax_currentis the absolute maximum value of the tensor being quantized, computed in the current iteration.
There is no history buffer and no margin parameter. The scaling factor is derived directly from the current tensor.
Precision Advantage
Current scaling achieves higher effective precision than delayed scaling because:
- No staleness: The scaling factor exactly matches the current tensor distribution. With delayed scaling, the tensor distribution may have shifted since the amax was recorded, causing either overflow (if values grew) or underutilization of the FP8 range (if values shrank).
- Tighter dynamic range mapping: Without a history buffer applying a conservative "max over history" policy, the scaling factor can more tightly map the tensor's actual range into the FP8 representable range.
Power-of-2 Scales
Current scaling optionally supports power-of-2 scaling factors, where the computed scale is rounded to the nearest power of 2. This is beneficial because:
- Power-of-2 multiplication can be implemented as a bit shift, which is faster on some hardware paths.
- It avoids potential rounding errors when the scale is applied as a floating-point multiplication.
- The cost is at most a factor-of-2 reduction in effective dynamic range utilization, which is typically acceptable.
Comparison with Delayed Scaling
| Aspect | Current Scaling | Delayed Scaling |
|---|---|---|
| Amax Source | Current tensor (this iteration) | Historical (previous iterations) |
| Staleness | None | One-step lag |
| Overhead | Extra amax reduction per tensor | None beyond normal forward pass |
| History Buffer | Not required | Required (configurable length) |
| Margin Parameter | Not applicable | Configurable |
| Best For | Accuracy-sensitive workloads, Blackwell+ GPUs | Throughput-sensitive workloads, Hopper GPUs |
Related Pages
- Implementation:NVIDIA_TransformerEngine_Float8CurrentScaling_Recipe -- The concrete recipe class implementing current scaling configuration.
- Principle:NVIDIA_TransformerEngine_FP8_Quantization -- The parent principle describing FP8 quantization in TransformerEngine.
- Principle:NVIDIA_TransformerEngine_FP8_Delayed_Scaling -- The alternative scaling strategy using historical amax values.