Heuristic:NVIDIA TransformerEngine FP8 Checkpoint Compatibility
| Knowledge Sources | |
|---|---|
| Domains | Debugging, LLMs, Checkpointing |
| Last Updated | 2026-02-07 21:00 GMT |
Overview
FP8 checkpoint migration guide for loading checkpoints across different TransformerEngine versions (1.5 through 1.12+), covering metadata key location changes.
Description
TransformerEngine's FP8 attention metadata (scaling factors and amax histories) is stored in checkpoint state dicts under `_extra_state` keys. The location of these keys has shifted across versions as FP8 attention support expanded. Checkpoints created with TE 1.6-1.7 store metadata at `core_attention.fused_attention._extra_state`, while TE >= 1.8 stores it at `core_attention._extra_state`. TE >= 1.12 can automatically load from both locations, but intermediate versions require manual key remapping.
Usage
Use this heuristic when loading pre-trained checkpoints that were saved with a different TransformerEngine version, or when encountering unexpected key errors during checkpoint loading. This is especially relevant when upgrading TE in production training pipelines.
The Insight (Rule of Thumb)
- Action: Check the TE version that created the checkpoint and apply the appropriate state dict key mapping.
- Value:
- TE <= 1.5 checkpoints: No FP8 metadata. TE initializes to defaults (1s for scales, 0s for amaxes). Safe to load.
- TE 1.6-1.7 checkpoints on TE >= 1.8, < 1.12: Manual key remapping required (see below).
- TE 1.6-1.7 checkpoints on TE >= 1.12: Automatic migration. No action needed.
- TE >= 1.8 checkpoints: Loads normally on TE >= 1.8.
- Trade-off: Missing FP8 metadata means the model starts with default scaling factors. This causes a brief accuracy warmup period (a few iterations) but is not harmful.
Migration Code
For loading TE 1.6-1.7 checkpoints on TE 1.8-1.11:
# Remap the FP8 metadata key for MultiheadAttention modules
state_dict["core_attention._extra_state"] = \
state_dict["core_attention.fused_attention._extra_state"]
del state_dict["core_attention.fused_attention._extra_state"]
Reasoning
The key location change reflects the evolution of FP8 attention in TransformerEngine. TE 1.6 introduced FP8 attention with a single fused backend, storing metadata under the `fused_attention` submodule. As multiple backends were added (FlashAttention, cuDNN, unfused), the metadata was moved to the `core_attention` level to be backend-agnostic. TE 1.12 added backward-compatible loading to automatically detect and handle old key locations.
Code Evidence
From `docs/faq.rst` (FP8 checkpoint compatibility table):
| TE Version | Save Location | Loads From |
|---|---|---|
| <= 1.5 | No FP8 metadata | N/A |
| 1.6-1.7 | `core_attention.fused_attention._extra_state` | <= 1.5 (defaults), 1.6-1.7 (direct) |
| 1.8-1.11 | `core_attention._extra_state` | <= 1.5 (defaults), 1.6-1.7 (manual remap), >= 1.8 (direct) |
| >= 1.12 | `core_attention._extra_state` | <= 1.5 (defaults), >= 1.6 (automatic) |