Principle:Alibaba ROLL MCoreAdapter Checkpoint Conversion
| Knowledge Sources | |
|---|---|
| Domains | Checkpointing, CLI_Tools |
| Last Updated | 2026-02-07 20:00 GMT |
Overview
Bidirectional checkpoint format conversion between a distributed sharded format and a standard consolidated format, enabling interoperability between parallel training and single-device inference frameworks.
Description
Distributed training frameworks shard model weights across GPUs according to the parallelism configuration (tensor parallel, pipeline parallel, expert parallel). The resulting checkpoint consists of multiple shard files, each containing a subset of the weights in a layout that depends on the specific parallelism degrees. Standard inference frameworks expect a single consolidated checkpoint with all weights in their original unsharded form.
This principle defines a checkpoint conversion tool that operates in two directions:
- MCA to HuggingFace: Reads a sharded checkpoint (with an mca_config.json describing the parallelism layout), gathers the sharded weight tensors, applies inverse transformations (concatenating tensor-parallel shards, merging pipeline-parallel stages), renames keys according to a model-specific template, and writes the result as HuggingFace-compatible safetensors files. This direction also supports converting LoRA adapter checkpoints by extracting adapter weights from the sharded format.
- HuggingFace to MCA: Reads a standard HuggingFace checkpoint, applies the model-specific template to rename keys, splits weights according to the target parallelism configuration (slicing along output dimensions for column-parallel layers, slicing along input dimensions for row-parallel layers), and writes the result as a sharded MCA checkpoint.
The conversion tool operates as a standalone CLI utility that can run on a single machine without distributed initialization. It uses the same DistributingParallelArguments as the training system to specify the target parallelism layout.
Usage
Use this principle when:
- You have completed distributed training and need to export the model checkpoint for single-device inference or deployment.
- You want to start distributed training from a pre-existing HuggingFace checkpoint without running a distributed conversion job.
- Converting LoRA adapter weights between sharded and consolidated formats for adapter merging or serving.
Theoretical Basis
Conversion direction detection:
IF exists(mca_config.json) in checkpoint_path:
direction = MCA_TO_HF
ELSE:
direction = HF_TO_MCA
MCA to HuggingFace conversion steps:
1. Load mca_config.json to determine parallelism layout
2. FOR each parameter key in the model:
a. Load shards from all TP/PP/EP ranks
b. Apply inverse sharding:
column_parallel: W = concatenate(W_0, W_1, ..., W_{T-1}, dim=0)
row_parallel: W = concatenate(W_0, W_1, ..., W_{T-1}, dim=1)
c. Apply template key renaming: mca_key -> hf_key
3. Write consolidated checkpoint as safetensors
4. Copy HuggingFace config files (config.json, tokenizer, etc.)
HuggingFace to MCA conversion steps:
1. Load HuggingFace config to determine model architecture
2. Look up model-specific template for key mapping
3. Parse DistributingParallelArguments for target layout
4. FOR each HuggingFace parameter:
a. Apply template key renaming: hf_key -> mca_key
b. Apply sharding:
column_parallel: W_r = W[:, r*(k/T):(r+1)*(k/T)]
row_parallel: W_r = W[r*(d/T):(r+1)*(d/T), :]
c. Write shard for each rank
5. Generate mca_config.json
Dtype conversion:
The conversion tool supports optional precision casting:
IF bf16: torch_dtype = torch.bfloat16 ELIF fp16: torch_dtype = torch.float16 ELSE: preserve original dtype