Principle:NVIDIA TransformerEngine HF To TE Weight Mapping
Overview
Mapping pretrained HuggingFace model weights to TransformerEngine's fused module parameter layout.
Description
TransformerEngine's fused modules (LayerNormLinear, LayerNormMLP) combine multiple HuggingFace parameters into single tensors for kernel fusion efficiency. When loading a pretrained HuggingFace LLaMA checkpoint into a TE-accelerated model, the weights must be correctly mapped from HF's parameter naming and layout conventions to TE's fused parameter structure.
The key mapping challenges are:
- Fused LayerNorm + QKV: HF stores
input_layernorm.weightand separateq_proj.weight,k_proj.weight,v_proj.weightas independent parameters. TE'slayernorm_qkvmodule combines the layer norm weight with the Q/K/V projection weights. - Fused LayerNorm + MLP: HF stores
post_attention_layernorm.weight,gate_proj.weight,up_proj.weight, anddown_proj.weightseparately. TE'slayernorm_mlpmodule fuses the layer norm weight and combines gate and up projections into a singlefc1_weighttensor. - Concatenated Gate/Up Projections: TE concatenates
gate_projandup_projinto a singlefc1_weighttensor. The gate projection occupies the firstintermediate_sizerows, and the up projection occupies the remaining rows. - Different Naming Conventions: HF uses
self_attn.q_proj.weightwhile TE usesself_attention.layernorm_qkv.query_weight.
Non-decoder-layer parameters (token embeddings, final layer norm, LM head) use the same names in both HF and TE and can be loaded directly via load_state_dict(strict=False).
Theoretical Basis
The complete weight mapping for each decoder layer (model.layers.N.) is:
| HF Parameter Name | TE Parameter Name | Notes |
|---|---|---|
input_layernorm.weight |
self_attention.layernorm_qkv.layer_norm_weight |
RMSNorm weight before self-attention |
self_attn.q_proj.weight |
self_attention.layernorm_qkv.query_weight |
Query projection weight |
self_attn.k_proj.weight |
self_attention.layernorm_qkv.key_weight |
Key projection weight |
self_attn.v_proj.weight |
self_attention.layernorm_qkv.value_weight |
Value projection weight |
self_attn.o_proj.weight |
self_attention.proj.weight |
Output projection weight |
post_attention_layernorm.weight |
layernorm_mlp.layer_norm_weight |
RMSNorm weight before MLP |
mlp.gate_proj.weight |
layernorm_mlp.fc1_weight[:intermediate_size] |
First half of fused fc1 (gate projection) |
mlp.up_proj.weight |
layernorm_mlp.fc1_weight[intermediate_size:] |
Second half of fused fc1 (up projection) |
mlp.down_proj.weight |
layernorm_mlp.fc2_weight |
Down projection weight |
The fc1_weight concatenation layout is critical:
# TE's fc1_weight has shape (2 * intermediate_size, hidden_size)
# First intermediate_size rows: gate_proj (SiLU activation path)
# Last intermediate_size rows: up_proj (linear path)
fc1_weight[:intermediate_size] = gate_proj.weight # SwiGLU gate
fc1_weight[intermediate_size:] = up_proj.weight # SwiGLU up
This layout matches TE's SwiGLU implementation, which splits fc1_weight at intermediate_size to compute SiLU(gate(x)) * up(x).
Important: The gate and up projection weights may reside in different checkpoint shards, so the mapping function must handle partial loading -- each weight is mapped independently with existence checks.
Usage
Use this principle when loading pretrained HuggingFace LLaMA checkpoints into TE-accelerated models. The weight mapping is applied:
- During
TELlamaForCausalLM.from_pretrained_local()for each checkpoint shard - When converting HF checkpoints to TE format for deployment
- When debugging weight loading issues between HF and TE model formats
The mapping function operates in-place on the TE model's state dict and handles sharded checkpoints by processing one shard at a time.