Principle:NVIDIA TransformerEngine TE Model Initialization

Overview

Constructing a HuggingFace model with TransformerEngine decoder layers using class factory and monkey-patching patterns.

Description

This principle describes how to create a HuggingFace LlamaForCausalLM instance where all decoder layers are TE TransformerLayer instances. The approach combines two Python patterns:

__new__ class factory: Python's __new__ method is used to create and return a LlamaForCausalLM instance (rather than a TELlamaForCausalLM instance), bypassing normal __init__ dispatch.
Context manager monkey-patching: The replace_decoder context manager (see Principle:NVIDIA_TransformerEngine_Layer_Monkey_Patching) temporarily swaps the decoder layer class during model construction.

The resulting object is a genuine LlamaForCausalLM instance with full access to all HuggingFace APIs:

model.generate() for text generation
model.save_pretrained() for checkpoint saving
model.parameters() for optimizer construction
Integration with HuggingFace Trainer and FSDP/DeepSpeed

The only difference from a standard LlamaForCausalLM is that the decoder layers are TELlamaDecoderLayer instances instead of LlamaDecoderLayer instances.

Additionally, the model provides a from_pretrained_local class method that:

Creates the TE-enhanced model from a config
Loads pretrained weights from a sharded HF checkpoint
Maps HF weight names to TE weight names using the weight mapping function (see Principle:NVIDIA_TransformerEngine_HF_To_TE_Weight_Mapping)
Handles memory management by processing one shard at a time and releasing memory after each

Theoretical Basis

The __new__ + context manager pattern allows constructing a standard HF model class while injecting custom layer implementations. This works because:

__new__ in Python controls object creation (not initialization). By defining __new__ to return a LlamaForCausalLM instance, the class acts as a factory rather than a traditional class.
The replace_decoder context manager ensures that during LlamaForCausalLM.__init__, the LlamaDecoderLayer class reference points to TELlamaDecoderLayer.
Since __new__ returns an instance of a different class (LlamaForCausalLM), Python does not call TELlamaForCausalLM.__init__ -- the full initialization is handled by LlamaForCausalLM.__init__ with the patched decoder class.

For weight loading, the from_pretrained_local method processes sharded checkpoints iteratively:

for shard_file in resolved_archive_file:
    state_dict = load_state_dict(shard_file)
    replace_params(state_dict, vanilla_model.state_dict(), config)  # TE-specific weights
    vanilla_model.load_state_dict(state_dict, strict=False)          # HF-compatible weights
    del state_dict
    gc.collect()

The strict=False parameter allows partial state dict loading since each shard contains only a subset of the model's parameters, and the TE-specific parameter names differ from what load_state_dict expects.

Usage

Use this principle after defining a TELlamaDecoderLayer to create the full model with TE layers. The typical workflow is:

Define TELlamaDecoderLayer (the TE decoder layer wrapper)
Define replace_decoder (the monkey-patching context manager)
Define TELlamaForCausalLM using __new__ + context manager
Instantiate with TELlamaForCausalLM(config) for random initialization, or use TELlamaForCausalLM.from_pretrained_local(path, config=config) for pretrained weights

Sources

TransformerEngine

Domains

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment