Principle:CarperAI Trlx Checkpoint Conversion
| Knowledge Sources | |
|---|---|
| Domains | Model_Conversion, NLP, Megatron |
| Last Updated | 2026-02-07 16:00 GMT |
Overview
Technique for transforming model weight checkpoints between different framework formats while correctly handling tensor parallelism sharding.
Description
Different training frameworks (HuggingFace Transformers, NeMo/Megatron, DeepSpeed) use different checkpoint formats with different naming conventions and tensor layouts. Checkpoint conversion maps weights between these formats while handling tensor model parallelism (TP), which requires slicing weight matrices along specific dimensions to distribute across GPU ranks. Key challenges include correctly partitioning attention (Q/K/V), MLP, and embedding layers, and generating the target framework's configuration metadata.
Usage
Use this principle when migrating models between training frameworks. Common scenarios include converting HuggingFace models to NeMo format for large-scale distributed training, or converting trained models back to HuggingFace format for inference and deployment.
Theoretical Basis
The conversion follows a weight-mapping protocol:
- Name Mapping: Map source parameter names to target naming conventions.
- Tensor Partitioning: For TP rank of total ranks:
for column-parallel layers, or the transpose for row-parallel layers.
- Config Generation: Generate the target framework's configuration YAML/JSON from source model attributes.
Pseudo-code Logic:
# Abstract algorithm (NOT real implementation)
for tp_rank in range(total_tp):
nemo_state = {}
for source_name, target_name in name_mapping:
weight = source_model[source_name]
if is_column_parallel(target_name):
weight = slice_columns(weight, tp_rank, total_tp)
elif is_row_parallel(target_name):
weight = slice_rows(weight, tp_rank, total_tp)
nemo_state[target_name] = weight
save(nemo_state, f"tp_rank_{tp_rank}/model_weights.pt")