Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:NVIDIA TransformerEngine FP8 Checkpoint Compatibility

From Leeroopedia




Knowledge Sources
Domains Debugging, LLMs, Checkpointing
Last Updated 2026-02-07 21:00 GMT

Overview

FP8 checkpoint migration guide for loading checkpoints across different TransformerEngine versions (1.5 through 1.12+), covering metadata key location changes.

Description

TransformerEngine's FP8 attention metadata (scaling factors and amax histories) is stored in checkpoint state dicts under `_extra_state` keys. The location of these keys has shifted across versions as FP8 attention support expanded. Checkpoints created with TE 1.6-1.7 store metadata at `core_attention.fused_attention._extra_state`, while TE >= 1.8 stores it at `core_attention._extra_state`. TE >= 1.12 can automatically load from both locations, but intermediate versions require manual key remapping.

Usage

Use this heuristic when loading pre-trained checkpoints that were saved with a different TransformerEngine version, or when encountering unexpected key errors during checkpoint loading. This is especially relevant when upgrading TE in production training pipelines.

The Insight (Rule of Thumb)

  • Action: Check the TE version that created the checkpoint and apply the appropriate state dict key mapping.
  • Value:
    • TE <= 1.5 checkpoints: No FP8 metadata. TE initializes to defaults (1s for scales, 0s for amaxes). Safe to load.
    • TE 1.6-1.7 checkpoints on TE >= 1.8, < 1.12: Manual key remapping required (see below).
    • TE 1.6-1.7 checkpoints on TE >= 1.12: Automatic migration. No action needed.
    • TE >= 1.8 checkpoints: Loads normally on TE >= 1.8.
  • Trade-off: Missing FP8 metadata means the model starts with default scaling factors. This causes a brief accuracy warmup period (a few iterations) but is not harmful.

Migration Code

For loading TE 1.6-1.7 checkpoints on TE 1.8-1.11:

# Remap the FP8 metadata key for MultiheadAttention modules
state_dict["core_attention._extra_state"] = \
    state_dict["core_attention.fused_attention._extra_state"]
del state_dict["core_attention.fused_attention._extra_state"]

Reasoning

The key location change reflects the evolution of FP8 attention in TransformerEngine. TE 1.6 introduced FP8 attention with a single fused backend, storing metadata under the `fused_attention` submodule. As multiple backends were added (FlashAttention, cuDNN, unfused), the metadata was moved to the `core_attention` level to be backend-agnostic. TE 1.12 added backward-compatible loading to automatically detect and handle old key locations.

Code Evidence

From `docs/faq.rst` (FP8 checkpoint compatibility table):

TE Version Save Location Loads From
<= 1.5 No FP8 metadata N/A
1.6-1.7 `core_attention.fused_attention._extra_state` <= 1.5 (defaults), 1.6-1.7 (direct)
1.8-1.11 `core_attention._extra_state` <= 1.5 (defaults), 1.6-1.7 (manual remap), >= 1.8 (direct)
>= 1.12 `core_attention._extra_state` <= 1.5 (defaults), >= 1.6 (automatic)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment