Principle:NVIDIA TransformerEngine HF To TE Weight Mapping

Overview

Mapping pretrained HuggingFace model weights to TransformerEngine's fused module parameter layout.

Description

TransformerEngine's fused modules (LayerNormLinear, LayerNormMLP) combine multiple HuggingFace parameters into single tensors for kernel fusion efficiency. When loading a pretrained HuggingFace LLaMA checkpoint into a TE-accelerated model, the weights must be correctly mapped from HF's parameter naming and layout conventions to TE's fused parameter structure.

The key mapping challenges are:

Fused LayerNorm + QKV: HF stores input_layernorm.weight and separate q_proj.weight, k_proj.weight, v_proj.weight as independent parameters. TE's layernorm_qkv module combines the layer norm weight with the Q/K/V projection weights.
Fused LayerNorm + MLP: HF stores post_attention_layernorm.weight, gate_proj.weight, up_proj.weight, and down_proj.weight separately. TE's layernorm_mlp module fuses the layer norm weight and combines gate and up projections into a single fc1_weight tensor.
Concatenated Gate/Up Projections: TE concatenates gate_proj and up_proj into a single fc1_weight tensor. The gate projection occupies the first intermediate_size rows, and the up projection occupies the remaining rows.
Different Naming Conventions: HF uses self_attn.q_proj.weight while TE uses self_attention.layernorm_qkv.query_weight.

Non-decoder-layer parameters (token embeddings, final layer norm, LM head) use the same names in both HF and TE and can be loaded directly via load_state_dict(strict=False).

Theoretical Basis

The complete weight mapping for each decoder layer (model.layers.N.) is:

HF Parameter Name	TE Parameter Name	Notes
`input_layernorm.weight`	`self_attention.layernorm_qkv.layer_norm_weight`	RMSNorm weight before self-attention
`self_attn.q_proj.weight`	`self_attention.layernorm_qkv.query_weight`	Query projection weight
`self_attn.k_proj.weight`	`self_attention.layernorm_qkv.key_weight`	Key projection weight
`self_attn.v_proj.weight`	`self_attention.layernorm_qkv.value_weight`	Value projection weight
`self_attn.o_proj.weight`	`self_attention.proj.weight`	Output projection weight
`post_attention_layernorm.weight`	`layernorm_mlp.layer_norm_weight`	RMSNorm weight before MLP
`mlp.gate_proj.weight`	`layernorm_mlp.fc1_weight[:intermediate_size]`	First half of fused fc1 (gate projection)
`mlp.up_proj.weight`	`layernorm_mlp.fc1_weight[intermediate_size:]`	Second half of fused fc1 (up projection)
`mlp.down_proj.weight`	`layernorm_mlp.fc2_weight`	Down projection weight

The fc1_weight concatenation layout is critical:

# TE's fc1_weight has shape (2 * intermediate_size, hidden_size)
# First intermediate_size rows: gate_proj (SiLU activation path)
# Last intermediate_size rows: up_proj (linear path)
fc1_weight[:intermediate_size] = gate_proj.weight  # SwiGLU gate
fc1_weight[intermediate_size:] = up_proj.weight     # SwiGLU up

This layout matches TE's SwiGLU implementation, which splits fc1_weight at intermediate_size to compute SiLU(gate(x)) * up(x).

Important: The gate and up projection weights may reside in different checkpoint shards, so the mapping function must handle partial loading -- each weight is mapped independently with existence checks.

Usage

Use this principle when loading pretrained HuggingFace LLaMA checkpoints into TE-accelerated models. The weight mapping is applied:

During TELlamaForCausalLM.from_pretrained_local() for each checkpoint shard
When converting HF checkpoints to TE format for deployment
When debugging weight loading issues between HF and TE model formats

The mapping function operates in-place on the TE model's state dict and handles sharded checkpoints by processing one shard at a time.

Sources

TransformerEngine

Domains

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment