Implementation:Alibaba ROLL McaModelConfig
| Knowledge Sources | |
|---|---|
| Domains | Configuration, Model_Architecture |
| Last Updated | 2026-02-07 20:00 GMT |
Overview
Configuration dataclasses that bridge HuggingFace model configs with Megatron-Core TransformerConfig, enabling seamless loading and conversion between the two checkpoint formats.
Description
model_config.py defines a hierarchy of configuration dataclasses used throughout the MCoreAdapter framework:
PretrainedConfig (lines 33-202) is the base configuration class that provides:
- JSON serialization and deserialization (to_json_string, from_json_file)
- Checkpoint save/load (save_pretrained, from_pretrained)
- HuggingFace auto-map file management for custom model architectures
- Argument merging from DistributingParallelArguments with priority handling
- Automatic detection and loading from either MCA or HuggingFace checkpoint formats
McaModelConfig (lines 205-374) inherits from both Megatron-Core's TransformerConfig and PretrainedConfig, combining distributed training configuration with model architecture configuration. Its __post_init__ method (lines 266-344) performs extensive validation and initialization:
- Configures activation functions (SwiGLU, squared ReLU)
- Sets precision dtype (fp16/bf16)
- Initializes Yarn RoPE parameters when position_embedding_type is yarn
- Configures recomputation defaults
- Enforces sequence parallelism constraints with tensor/expert parallelism
- Validates pipeline parallel layer divisibility
MLAMcaModelConfig (lines 377-382) extends McaModelConfig with MLATransformerConfig for Multi-Latent Attention support.
Usage
Use McaModelConfig.from_pretrained() to load model configuration from either a Megatron-Core checkpoint (containing mca_config.json) or a HuggingFace checkpoint (containing config.json). Pass TrainingArguments to configure distributed parallelism settings that override checkpoint defaults.
Code Reference
Source Location
- Repository: Alibaba_ROLL
- File: mcore_adapter/src/mcore_adapter/models/model_config.py
- Lines: 1-382
Key Classes
PretrainedConfig
@dataclass
class PretrainedConfig:
name_or_path: Optional[str] = None
hf_model_type: Optional[str] = None
hf_config_json: Optional[str] = None
Key methods:
- from_pretrained(model_name_or_path, args) (lines 157-192): Class method that loads config from MCA JSON or HuggingFace checkpoint. If args is provided, merges parallelism settings and calls initialize_megatron(). Caches auto-map files to local disk on rank 0.
- save_pretrained(save_directory) (lines 106-112): Saves config as mca_config.json and copies any HuggingFace auto-map Python files.
- update_with_args(args, verbose) (lines 132-155): Merges DistributingParallelArguments fields into config, with args taking priority. Logs when values differ from checkpoint values.
- from_json_file(json_file_path) (lines 82-104): Loads config from JSON, filtering out unknown/deprecated keys with a warning.
- to_json_string() (lines 64-75): Serializes dataclass fields to JSON, skipping callables and enums.
McaModelConfig
@dataclass
class McaModelConfig(TransformerConfig, PretrainedConfig):
position_embedding_type: Literal["learned_absolute", "rope", "mrope", "yarn", "none"] = "rope"
padded_vocab_size: Optional[int] = None
swiglu: bool = False
tie_embeddings_and_output_weights: bool = False
max_sequence_length: int = 0
rotary_base: int = 10000
transformer_impl: Literal["local", "transformer_engine"] = "transformer_engine"
# ... additional fields
Key methods:
- __post_init__() (lines 266-358): Validates and initializes all configuration. Sets activation functions, precision, recomputation defaults, sequence parallelism requirements, attention backends, MoE router dtypes, and pipeline layout.
- distribute_config_match(other) (lines 360-374): Checks whether two configs have compatible distributed settings (TP, PP, VP, EP sizes, transformer implementation, and pipeline split accounting).
MLAMcaModelConfig
@dataclass
class MLAMcaModelConfig(McaModelConfig, MLATransformerConfig):
multi_latent_attention: Optional[bool] = True
Extension for Multi-Latent Attention (MLA) models like DeepSeek. Combines McaModelConfig with Megatron-Core's MLATransformerConfig.
Import
import dataclasses
import torch
import torch.nn.functional as F
from megatron.core.transformer import MLATransformerConfig, TransformerConfig
from transformers import AutoConfig
from mcore_adapter.models.model_config import McaModelConfig, MLAMcaModelConfig, PretrainedConfig
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model_name_or_path | str | Yes | Path to checkpoint directory containing mca_config.json or config.json |
| args | TrainingArguments | No | Training arguments with parallelism settings to merge into config |
| position_embedding_type | str | No | Type of position embedding: rope, mrope, yarn, learned_absolute, or none |
| tensor_model_parallel_size | int | No | Degree of tensor parallelism |
| pipeline_model_parallel_size | int | No | Degree of pipeline parallelism |
| transformer_impl | str | No | transformer_engine or local |
Outputs
| Name | Type | Description |
|---|---|---|
| config | McaModelConfig | Fully initialized model configuration ready for model construction |
Usage Examples
from mcore_adapter.models.model_config import McaModelConfig
from mcore_adapter.training_args import TrainingArguments
# Load from HuggingFace checkpoint with custom parallelism
args = TrainingArguments(
tensor_model_parallel_size=4,
pipeline_model_parallel_size=2,
bf16=True,
)
config = McaModelConfig.from_pretrained("/path/to/model", args)
# Load from MCA checkpoint (auto-detected)
config = McaModelConfig.from_pretrained("/path/to/mca_checkpoint", args)
# Save config
config.save_pretrained("/path/to/output")
# Check compatibility between two configs
old_config = McaModelConfig.from_pretrained("/path/to/old_checkpoint")
is_compatible = config.distribute_config_match(old_config)