Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:NVIDIA TransformerEngine HF To TE Weight Mapping

From Leeroopedia
Revision as of 18:23, 16 February 2026 by Admin (talk | contribs) (Auto-imported from principles/NVIDIA_TransformerEngine_HF_To_TE_Weight_Mapping.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Overview

Mapping pretrained HuggingFace model weights to TransformerEngine's fused module parameter layout.

Description

TransformerEngine's fused modules (LayerNormLinear, LayerNormMLP) combine multiple HuggingFace parameters into single tensors for kernel fusion efficiency. When loading a pretrained HuggingFace LLaMA checkpoint into a TE-accelerated model, the weights must be correctly mapped from HF's parameter naming and layout conventions to TE's fused parameter structure.

The key mapping challenges are:

  • Fused LayerNorm + QKV: HF stores input_layernorm.weight and separate q_proj.weight, k_proj.weight, v_proj.weight as independent parameters. TE's layernorm_qkv module combines the layer norm weight with the Q/K/V projection weights.
  • Fused LayerNorm + MLP: HF stores post_attention_layernorm.weight, gate_proj.weight, up_proj.weight, and down_proj.weight separately. TE's layernorm_mlp module fuses the layer norm weight and combines gate and up projections into a single fc1_weight tensor.
  • Concatenated Gate/Up Projections: TE concatenates gate_proj and up_proj into a single fc1_weight tensor. The gate projection occupies the first intermediate_size rows, and the up projection occupies the remaining rows.
  • Different Naming Conventions: HF uses self_attn.q_proj.weight while TE uses self_attention.layernorm_qkv.query_weight.

Non-decoder-layer parameters (token embeddings, final layer norm, LM head) use the same names in both HF and TE and can be loaded directly via load_state_dict(strict=False).

Theoretical Basis

The complete weight mapping for each decoder layer (model.layers.N.) is:

HF Parameter Name TE Parameter Name Notes
input_layernorm.weight self_attention.layernorm_qkv.layer_norm_weight RMSNorm weight before self-attention
self_attn.q_proj.weight self_attention.layernorm_qkv.query_weight Query projection weight
self_attn.k_proj.weight self_attention.layernorm_qkv.key_weight Key projection weight
self_attn.v_proj.weight self_attention.layernorm_qkv.value_weight Value projection weight
self_attn.o_proj.weight self_attention.proj.weight Output projection weight
post_attention_layernorm.weight layernorm_mlp.layer_norm_weight RMSNorm weight before MLP
mlp.gate_proj.weight layernorm_mlp.fc1_weight[:intermediate_size] First half of fused fc1 (gate projection)
mlp.up_proj.weight layernorm_mlp.fc1_weight[intermediate_size:] Second half of fused fc1 (up projection)
mlp.down_proj.weight layernorm_mlp.fc2_weight Down projection weight

The fc1_weight concatenation layout is critical:

# TE's fc1_weight has shape (2 * intermediate_size, hidden_size)
# First intermediate_size rows: gate_proj (SiLU activation path)
# Last intermediate_size rows: up_proj (linear path)
fc1_weight[:intermediate_size] = gate_proj.weight  # SwiGLU gate
fc1_weight[intermediate_size:] = up_proj.weight     # SwiGLU up

This layout matches TE's SwiGLU implementation, which splits fc1_weight at intermediate_size to compute SiLU(gate(x)) * up(x).

Important: The gate and up projection weights may reside in different checkpoint shards, so the mapping function must handle partial loading -- each weight is mapped independently with existence checks.

Usage

Use this principle when loading pretrained HuggingFace LLaMA checkpoints into TE-accelerated models. The weight mapping is applied:

  • During TELlamaForCausalLM.from_pretrained_local() for each checkpoint shard
  • When converting HF checkpoints to TE format for deployment
  • When debugging weight loading issues between HF and TE model formats

The mapping function operates in-place on the TE model's state dict and handles sharded checkpoints by processing one shard at a time.

Related

Sources

Domains

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment