Implementation:NVIDIA TransformerEngine DelayedScaling Recipe
| Field | Value |
|---|---|
| Page Type | Implementation |
| Repository | NVIDIA TransformerEngine |
| Source File | transformer_engine/common/recipe/__init__.py (L121-220)
|
| Import | from transformer_engine.common.recipe import DelayedScaling
|
| Implements | Principle:NVIDIA_TransformerEngine_FP8_Delayed_Scaling |
| Requires Environment | Environment:NVIDIA_TransformerEngine_CUDA_Toolkit_Requirements |
Overview
Concrete FP8 recipe configuration for delayed scaling provided by TransformerEngine.
Description
DelayedScaling is a frozen pydantic dataclass that configures the delayed scaling FP8 quantization strategy. It specifies the FP8 format (HYBRID by default), amax history length, computation algorithm, and margin. This recipe is passed to te.autocast(recipe=...) to enable FP8 training with delayed scaling.
As a frozen dataclass, DelayedScaling instances are immutable after construction. All configuration must be provided at instantiation time. The class inherits from Recipe, the base class for all TransformerEngine FP8 recipes.
Usage
Use DelayedScaling as the default recipe for FP8 training. It is the recommended starting point for most workloads:
- Instantiate with desired parameters (or use defaults for a standard configuration).
- Pass the instance to
te.autocast(recipe=recipe). - The recipe is read by TE modules during their forward pass to determine scaling behavior.
Code Reference
Source Location
| Attribute | Detail |
|---|---|
| File | transformer_engine/common/recipe/__init__.py
|
| Class | DelayedScaling
|
| Lines | L121-220 |
| Base Class | Recipe
|
Signature
@dataclass(frozen=True)
class DelayedScaling(Recipe):
margin: int = 0
fp8_format: Format = Format.HYBRID
amax_history_len: int = 1024
amax_compute_algo: str = "max"
reduce_amax: bool = True
fp8_dpa: bool = False
fp8_mha: bool = False
Key Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
margin |
int |
0 |
Safety margin for scaling factor computation. The scaling factor is divided by 2^margin, reducing the effective FP8 range to provide headroom against overflow.
|
fp8_format |
Format |
Format.HYBRID |
The FP8 format to use. HYBRID uses E4M3 for forward and E5M2 for backward. E4M3 and E5M2 use the same format for both passes.
|
amax_history_len |
int |
1024 |
Length of the amax history buffer. Larger values produce more stable but less responsive scaling factors. |
amax_compute_algo |
str |
"max" |
Algorithm to compute effective amax from history. Options: "max" (maximum over history) or "most_recent" (last recorded value).
|
reduce_amax |
bool |
True |
Whether to synchronize (all-reduce) amax values across distributed ranks. Set to True for data-parallel or tensor-parallel training.
|
fp8_dpa |
bool |
False |
Enable FP8 execution for dot-product attention (the QK^T computation). |
fp8_mha |
bool |
False |
Enable FP8 execution for the full multi-head attention block, including the attention output projection. |
I/O Contract
Input
| Input | Type | Description |
|---|---|---|
margin |
int |
Scaling factor safety margin. |
fp8_format |
Format |
FP8 format selection (HYBRID, E4M3, or E5M2). |
amax_history_len |
int |
Size of the rolling amax history buffer. |
amax_compute_algo |
str |
Algorithm for deriving effective amax from history. |
reduce_amax |
bool |
Whether to all-reduce amax across distributed ranks. |
fp8_dpa |
bool |
Enable FP8 dot-product attention. |
fp8_mha |
bool |
Enable FP8 multi-head attention. |
Output
| Output | Type | Description |
|---|---|---|
| Recipe object | DelayedScaling |
An immutable recipe instance passed to te.autocast(recipe=...). Consumed by TE modules to configure FP8 scaling behavior during the forward and backward passes.
|
Usage Examples
Default Configuration
from transformer_engine.common.recipe import DelayedScaling
# All defaults: HYBRID format, history length 1024, "max" algorithm
recipe = DelayedScaling()
Custom History and Algorithm
from transformer_engine.common.recipe import DelayedScaling, Format
recipe = DelayedScaling(
fp8_format=Format.HYBRID,
amax_history_len=1024,
amax_compute_algo="max",
)
With Safety Margin
from transformer_engine.common.recipe import DelayedScaling, Format
# Use margin=2 for extra headroom against overflow
recipe = DelayedScaling(
fp8_format=Format.HYBRID,
margin=2,
amax_history_len=512,
amax_compute_algo="most_recent",
)
With FP8 Attention Enabled
from transformer_engine.common.recipe import DelayedScaling
recipe = DelayedScaling(
fp8_dpa=True,
fp8_mha=True,
)
Full Training Loop Example
import transformer_engine.pytorch as te
from transformer_engine.common.recipe import DelayedScaling, Format
recipe = DelayedScaling(
fp8_format=Format.HYBRID,
amax_history_len=1024,
amax_compute_algo="max",
)
model = te.TransformerLayer(
hidden_size=1024,
ffn_hidden_size=4096,
num_attention_heads=16,
)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
for batch in dataloader:
optimizer.zero_grad()
with te.autocast(enabled=True, recipe=recipe):
output = model(batch["input"])
loss = loss_fn(output, batch["target"])
loss.backward()
optimizer.step()
Related Pages
- Principle:NVIDIA_TransformerEngine_FP8_Delayed_Scaling -- The principle describing delayed scaling for FP8.
- Implementation:NVIDIA_TransformerEngine_TE_Autocast -- The context manager that consumes this recipe.
- Implementation:NVIDIA_TransformerEngine_Float8CurrentScaling_Recipe -- The alternative current scaling recipe.
- Environment:NVIDIA_TransformerEngine_CUDA_Toolkit_Requirements
- Environment:NVIDIA_TransformerEngine_GPU_Compute_Capability
- Heuristic:NVIDIA_TransformerEngine_FP8_Recipe_Auto_Selection