Implementation:Alibaba ROLL Initialize Megatron
| Knowledge Sources | |
|---|---|
| Domains | Distributed_Computing, Initialization |
| Last Updated | 2026-02-07 20:00 GMT |
Overview
Initialization functions for setting up Megatron-Core's distributed environment, including process groups, model parallelism, and reproducible random seeds.
Description
This module provides the entry point for initializing Megatron-Core's distributed training infrastructure within the mcore_adapter framework. The initialize_megatron function serves as the main orchestrator, first checking whether model parallelism is already initialized and then delegating to _initialize_distributed for process group setup and _set_random_seed for reproducible seeding.
The _initialize_distributed function performs two key tasks: (1) initializing torch.distributed with the appropriate communication backend (e.g., NCCL for CUDA, HCCL for Ascend), using environment variables RANK and WORLD_SIZE for process identification, and (2) setting up Megatron-Core model parallelism groups via mpu.initialize_model_parallel, configuring tensor, pipeline, virtual pipeline, context, and expert parallelism dimensions from the TrainingArguments configuration.
The _set_random_seed function sets seeds across Python's random, NumPy, PyTorch, and Megatron's tensor parallel CUDA manual seed to ensure deterministic behavior across all components.
Usage
Call initialize_megatron at the start of training before any model creation or data loading. This is typically invoked once per process in a distributed training job. The function is idempotent with respect to distributed initialization (it checks is_distribute_initialized first) but will always reset random seeds.
Code Reference
Source Location
- Repository: Alibaba_ROLL
- File: mcore_adapter/src/mcore_adapter/initialize.py
- Lines: 1-71
Signature
def is_distribute_initialized() -> bool: ...
def _set_random_seed(seed_: int) -> None: ...
def initialize_megatron(args: TrainingArguments) -> None: ...
def _initialize_distributed(args: TrainingArguments) -> None: ...
Import
from mcore_adapter.initialize import initialize_megatron, is_distribute_initialized
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| args | TrainingArguments | Yes | Training configuration containing seed, device info, parallelism sizes (tensor_model_parallel_size, pipeline_model_parallel_size, virtual_pipeline_model_parallel_size, context_parallel_size, expert_model_parallel_size), ddp_backend, and ddp_timeout_delta |
| seed_ | int | Yes (for _set_random_seed) | Positive integer seed for reproducibility |
Outputs
| Name | Type | Description |
|---|---|---|
| (is_distribute_initialized) | bool | Whether Megatron model parallelism has been initialized |
| (initialize_megatron) | None | Side effects: initializes distributed process groups and sets random seeds |
| (_initialize_distributed) | None | Side effects: initializes torch.distributed and Megatron model parallel groups |
| (_set_random_seed) | None | Side effects: sets random seeds in random, numpy, torch, and tensor_parallel |
Usage Examples
from mcore_adapter.initialize import initialize_megatron
from mcore_adapter.training_args import TrainingArguments
# Create training arguments with parallelism config
args = TrainingArguments(
seed=42,
tensor_model_parallel_size=4,
pipeline_model_parallel_size=2,
expert_model_parallel_size=1,
)
# Initialize Megatron distributed environment
initialize_megatron(args)
# Now torch.distributed and Megatron model parallel groups are ready