Principle:Alibaba ROLL MCoreAdapter Model Factory
| Knowledge Sources | |
|---|---|
| Domains | Model_Architecture, Distributed_Computing |
| Last Updated | 2026-02-07 20:00 GMT |
Overview
A factory pattern for constructing distributed transformer models with virtual pipeline parallelism, managing bidirectional weight conversion between checkpoint formats during loading and saving.
Description
In large-scale distributed training, a single logical model may be split across multiple GPUs using pipeline parallelism. With virtual pipeline parallelism, each GPU holds multiple non-contiguous chunks (stages) of the model rather than a single contiguous block, which reduces pipeline bubble overhead by enabling interleaved scheduling.
This principle describes the model construction and loading factory that handles three responsibilities:
- Virtual Pipeline Model Construction: A VirtualModels wrapper creates multiple model instances, one per virtual pipeline stage. Each instance is initialized with the correct pre_process and post_process flags indicating whether it contains the embedding layer (first stage) or the output/loss layer (last stage).
- Bidirectional Checkpoint Loading: The factory determines at load time whether the checkpoint is in the native sharded format or in HuggingFace format. For native checkpoints with matching parallelism, state dictionaries are loaded directly. For HuggingFace checkpoints, a ModelConverter performs on-the-fly weight reshaping, splitting, and renaming to match the distributed layout.
- Bidirectional Checkpoint Saving: Models can be saved in both the native distributed format (preserving the parallel layout for efficient resumption) and the HuggingFace format (gathering sharded weights back to a single consolidated format for interoperability).
The factory also handles weight tying (shared embeddings and output weights), vocabulary resizing when the tokenizer is larger than the model vocabulary, and PEFT (Parameter-Efficient Fine-Tuning) model wrapping for LoRA adapter state management.
Usage
Use this principle when:
- Building a model loading pipeline that must support both distributed-native and standard checkpoint formats.
- The training system uses virtual pipeline parallelism and needs to manage multiple model chunks per GPU with correct stage assignments.
- You need to save trained model weights in a format compatible with standard inference frameworks.
Theoretical Basis
Virtual pipeline model creation:
FOR i IN range(virtual_pipeline_parallel_size):
set_virtual_pipeline_rank(i)
pre_process = is_pipeline_first_stage(vp_stage=i)
post_process = is_pipeline_last_stage(vp_stage=i)
models[i] = GPTModel(config, pre_process, post_process, vp_stage=i)
Checkpoint loading decision:
IF exists(mca_config) AND distribute_config_match(saved, current):
state_dict = load_sharded_checkpoint(path)
ELIF exists(hf_config):
FOR each vp_stage:
state_dict[stage] = converter.load_from_hf(path, vp_stage)
ELSE:
RAISE error
Weight conversion (HuggingFace to distributed):
For a linear layer with tensor parallelism of size :
column_parallel: W_shard = W[:, rank * (k/T) : (rank+1) * (k/T)] row_parallel: W_shard = W[rank * (d/T) : (rank+1) * (d/T), :]
For pipeline parallelism with stages and total layers:
layers_per_stage = L / P stage_layers = layers[rank * layers_per_stage : (rank+1) * layers_per_stage]
Sharded state dictionary structure:
Single VP: {"model": model.sharded_state_dict()}
Multi VP: {"model0": models[0].sharded_state_dict(),
"model1": models[1].sharded_state_dict(), ...}