Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Alibaba ROLL McaModelConfig

From Leeroopedia


Knowledge Sources
Domains Configuration, Model_Architecture
Last Updated 2026-02-07 20:00 GMT

Overview

Configuration dataclasses that bridge HuggingFace model configs with Megatron-Core TransformerConfig, enabling seamless loading and conversion between the two checkpoint formats.

Description

model_config.py defines a hierarchy of configuration dataclasses used throughout the MCoreAdapter framework:

PretrainedConfig (lines 33-202) is the base configuration class that provides:

  • JSON serialization and deserialization (to_json_string, from_json_file)
  • Checkpoint save/load (save_pretrained, from_pretrained)
  • HuggingFace auto-map file management for custom model architectures
  • Argument merging from DistributingParallelArguments with priority handling
  • Automatic detection and loading from either MCA or HuggingFace checkpoint formats

McaModelConfig (lines 205-374) inherits from both Megatron-Core's TransformerConfig and PretrainedConfig, combining distributed training configuration with model architecture configuration. Its __post_init__ method (lines 266-344) performs extensive validation and initialization:

  • Configures activation functions (SwiGLU, squared ReLU)
  • Sets precision dtype (fp16/bf16)
  • Initializes Yarn RoPE parameters when position_embedding_type is yarn
  • Configures recomputation defaults
  • Enforces sequence parallelism constraints with tensor/expert parallelism
  • Validates pipeline parallel layer divisibility

MLAMcaModelConfig (lines 377-382) extends McaModelConfig with MLATransformerConfig for Multi-Latent Attention support.

Usage

Use McaModelConfig.from_pretrained() to load model configuration from either a Megatron-Core checkpoint (containing mca_config.json) or a HuggingFace checkpoint (containing config.json). Pass TrainingArguments to configure distributed parallelism settings that override checkpoint defaults.

Code Reference

Source Location

Key Classes

PretrainedConfig

@dataclass
class PretrainedConfig:
    name_or_path: Optional[str] = None
    hf_model_type: Optional[str] = None
    hf_config_json: Optional[str] = None

Key methods:

  • from_pretrained(model_name_or_path, args) (lines 157-192): Class method that loads config from MCA JSON or HuggingFace checkpoint. If args is provided, merges parallelism settings and calls initialize_megatron(). Caches auto-map files to local disk on rank 0.
  • save_pretrained(save_directory) (lines 106-112): Saves config as mca_config.json and copies any HuggingFace auto-map Python files.
  • update_with_args(args, verbose) (lines 132-155): Merges DistributingParallelArguments fields into config, with args taking priority. Logs when values differ from checkpoint values.
  • from_json_file(json_file_path) (lines 82-104): Loads config from JSON, filtering out unknown/deprecated keys with a warning.
  • to_json_string() (lines 64-75): Serializes dataclass fields to JSON, skipping callables and enums.

McaModelConfig

@dataclass
class McaModelConfig(TransformerConfig, PretrainedConfig):
    position_embedding_type: Literal["learned_absolute", "rope", "mrope", "yarn", "none"] = "rope"
    padded_vocab_size: Optional[int] = None
    swiglu: bool = False
    tie_embeddings_and_output_weights: bool = False
    max_sequence_length: int = 0
    rotary_base: int = 10000
    transformer_impl: Literal["local", "transformer_engine"] = "transformer_engine"
    # ... additional fields

Key methods:

  • __post_init__() (lines 266-358): Validates and initializes all configuration. Sets activation functions, precision, recomputation defaults, sequence parallelism requirements, attention backends, MoE router dtypes, and pipeline layout.
  • distribute_config_match(other) (lines 360-374): Checks whether two configs have compatible distributed settings (TP, PP, VP, EP sizes, transformer implementation, and pipeline split accounting).

MLAMcaModelConfig

@dataclass
class MLAMcaModelConfig(McaModelConfig, MLATransformerConfig):
    multi_latent_attention: Optional[bool] = True

Extension for Multi-Latent Attention (MLA) models like DeepSeek. Combines McaModelConfig with Megatron-Core's MLATransformerConfig.

Import

import dataclasses
import torch
import torch.nn.functional as F
from megatron.core.transformer import MLATransformerConfig, TransformerConfig
from transformers import AutoConfig
from mcore_adapter.models.model_config import McaModelConfig, MLAMcaModelConfig, PretrainedConfig

I/O Contract

Inputs

Name Type Required Description
model_name_or_path str Yes Path to checkpoint directory containing mca_config.json or config.json
args TrainingArguments No Training arguments with parallelism settings to merge into config
position_embedding_type str No Type of position embedding: rope, mrope, yarn, learned_absolute, or none
tensor_model_parallel_size int No Degree of tensor parallelism
pipeline_model_parallel_size int No Degree of pipeline parallelism
transformer_impl str No transformer_engine or local

Outputs

Name Type Description
config McaModelConfig Fully initialized model configuration ready for model construction

Usage Examples

from mcore_adapter.models.model_config import McaModelConfig
from mcore_adapter.training_args import TrainingArguments

# Load from HuggingFace checkpoint with custom parallelism
args = TrainingArguments(
    tensor_model_parallel_size=4,
    pipeline_model_parallel_size=2,
    bf16=True,
)
config = McaModelConfig.from_pretrained("/path/to/model", args)

# Load from MCA checkpoint (auto-detected)
config = McaModelConfig.from_pretrained("/path/to/mca_checkpoint", args)

# Save config
config.save_pretrained("/path/to/output")

# Check compatibility between two configs
old_config = McaModelConfig.from_pretrained("/path/to/old_checkpoint")
is_compatible = config.distribute_config_match(old_config)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment