Principle:Alibaba ROLL MCoreAdapter Model Configuration

Knowledge Sources	Alibaba_ROLL
Domains	Configuration, Model_Architecture
Last Updated	2026-02-07 20:00 GMT

Overview

A unified configuration dataclass that bridges framework-specific model specifications with distributed parallelism parameters, enabling cross-framework model loading through a single configuration object.

Description

Training large language models across different frameworks (such as a custom distributed training engine and the HuggingFace ecosystem) requires translating between their respective configuration formats. Each framework encodes model hyperparameters differently: one may specify num_attention_heads and hidden_size directly, while another derives these from a TransformerConfig with fields for tensor parallelism, pipeline parallelism, and activation recomputation.

This principle defines a configuration layer that inherits from both the distributed engine's TransformerConfig and a PretrainedConfig base class. It serves three purposes:

Format Translation: When loading from a HuggingFace checkpoint, the configuration is automatically converted via a template system that maps HuggingFace config keys to the internal representation. When loading from a native checkpoint, the JSON configuration is read directly.
Parallelism Validation: The configuration enforces consistency constraints such as ensuring the number of layers is divisible by the product of pipeline and virtual pipeline sizes, enabling sequence parallelism when both tensor and expert parallelism are active, and selecting the correct MoE token dispatcher for variable-length sequences.
Checkpoint Compatibility Checking: A distribute_config_match method verifies that a saved checkpoint was created with the same parallelism configuration (tensor parallel size, pipeline parallel size, expert parallel size, etc.) as the current training session, determining whether direct loading or format conversion is required.

Usage

Use this principle when:

Designing a configuration system that must support loading models from multiple checkpoint formats with different parallelism layouts.
You need to validate that parallelism parameters (TP, PP, EP, CP) are self-consistent before initializing distributed process groups.
The system must decide at load time whether a checkpoint can be loaded directly or requires weight conversion.

Theoretical Basis

The configuration hierarchy follows a multiple-inheritance pattern:

TransformerConfig (megatron.core)
        |
PretrainedConfig (local base)
        |
   McaModelConfig
        |
 MLAMcaModelConfig (Multi-Latent Attention variant)

Key validation constraints:

1. Layer divisibility: Failed to parse (syntax error): {\displaystyle \text{num\_layers} \mod (\text{pp\_size} \times \text{vpp\_size}) = 0}

2. Sequence parallelism requirement:

IF tensor_model_parallel_size > 1 AND expert_model_parallel_size > 1:
    sequence_parallel = True

3. Checkpoint compatibility check:

distribute_config_match(old, new) = (
    old.tp == new.tp AND
    old.pp == new.pp AND
    old.vpp == new.vpp AND
    old.ep == new.ep AND
    old.transformer_impl == new.transformer_impl
)

Configuration loading decision tree:

IF mca_config.json exists in checkpoint:
    config = load from JSON
    IF distribute_config_match(saved_config, current_config):
        load state_dict directly
    ELSE:
        convert from HuggingFace format
ELSE IF config.json (HuggingFace) exists:
    config = template.convert_hf_to_mca_config(hf_config)
    convert weights from HuggingFace format

Related Pages

Implementation:Alibaba_ROLL_McaModelConfig

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment