Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:NVIDIA TransformerEngine FP8 Current Scaling

From Leeroopedia


Field Value
Page Type Principle
Repository NVIDIA TransformerEngine
Domains Deep_Learning, Quantization
Sources TransformerEngine, FP8 Formats for Deep Learning
Implemented By Implementation:NVIDIA_TransformerEngine_Float8CurrentScaling_Recipe

Overview

Computing FP8 scaling factors from the current iteration's absolute maximum for more precise quantization.

Description

Current scaling computes the amax of each tensor in the current forward pass and uses it to determine the scaling factor immediately. This provides more accurate scaling at the cost of an additional reduction operation per tensor per iteration.

Unlike delayed scaling, which derives scaling factors from a history buffer of past amax values, current scaling operates on the live tensor data. The process for each quantized tensor is:

  1. Compute the absolute maximum (amax) of the tensor.
  2. Derive the scaling factor: scale = FP8_MAX / amax.
  3. Apply the scaling factor and quantize the tensor to FP8.

This eliminates the one-step staleness inherent in delayed scaling, ensuring that the scaling factor is always optimally matched to the current tensor distribution. The trade-off is the additional overhead of the amax reduction operation, which requires a full pass over the tensor data before quantization can proceed.

Usage

Use current scaling when:

  • Training accuracy is more important than throughput: Current scaling produces tighter quantization with less wasted dynamic range.
  • Delayed scaling produces unstable training: If the model exhibits loss spikes or divergence with delayed scaling, current scaling can provide more stable behavior.
  • Running on Blackwell+ GPUs: On newer GPU architectures, the overhead of per-tensor amax computation is minimal due to hardware improvements, making current scaling the preferred default.
  • Fine-tuning sensitive models: When fine-tuning models where small quantization errors can compound, current scaling provides better fidelity.

Avoid current scaling when:

  • The per-tensor amax overhead measurably reduces throughput (primarily on Hopper GPUs).
  • Delayed scaling already produces satisfactory training quality.

Theoretical Basis

Scaling Factor Computation

The scaling factor for current scaling is computed as:

scale = FP8_MAX / amax_current

where:

  • FP8_MAX is the maximum representable value in the target FP8 format (448 for E4M3, 57344 for E5M2).
  • amax_current is the absolute maximum value of the tensor being quantized, computed in the current iteration.

There is no history buffer and no margin parameter. The scaling factor is derived directly from the current tensor.

Precision Advantage

Current scaling achieves higher effective precision than delayed scaling because:

  • No staleness: The scaling factor exactly matches the current tensor distribution. With delayed scaling, the tensor distribution may have shifted since the amax was recorded, causing either overflow (if values grew) or underutilization of the FP8 range (if values shrank).
  • Tighter dynamic range mapping: Without a history buffer applying a conservative "max over history" policy, the scaling factor can more tightly map the tensor's actual range into the FP8 representable range.

Power-of-2 Scales

Current scaling optionally supports power-of-2 scaling factors, where the computed scale is rounded to the nearest power of 2. This is beneficial because:

  • Power-of-2 multiplication can be implemented as a bit shift, which is faster on some hardware paths.
  • It avoids potential rounding errors when the scale is applied as a floating-point multiplication.
  • The cost is at most a factor-of-2 reduction in effective dynamic range utilization, which is typically acceptable.

Comparison with Delayed Scaling

Aspect Current Scaling Delayed Scaling
Amax Source Current tensor (this iteration) Historical (previous iterations)
Staleness None One-step lag
Overhead Extra amax reduction per tensor None beyond normal forward pass
History Buffer Not required Required (configurable length)
Margin Parameter Not applicable Configurable
Best For Accuracy-sensitive workloads, Blackwell+ GPUs Throughput-sensitive workloads, Hopper GPUs

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment