Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Alibaba ROLL Initialize Megatron

From Leeroopedia


Knowledge Sources
Domains Distributed_Computing, Initialization
Last Updated 2026-02-07 20:00 GMT

Overview

Initialization functions for setting up Megatron-Core's distributed environment, including process groups, model parallelism, and reproducible random seeds.

Description

This module provides the entry point for initializing Megatron-Core's distributed training infrastructure within the mcore_adapter framework. The initialize_megatron function serves as the main orchestrator, first checking whether model parallelism is already initialized and then delegating to _initialize_distributed for process group setup and _set_random_seed for reproducible seeding.

The _initialize_distributed function performs two key tasks: (1) initializing torch.distributed with the appropriate communication backend (e.g., NCCL for CUDA, HCCL for Ascend), using environment variables RANK and WORLD_SIZE for process identification, and (2) setting up Megatron-Core model parallelism groups via mpu.initialize_model_parallel, configuring tensor, pipeline, virtual pipeline, context, and expert parallelism dimensions from the TrainingArguments configuration.

The _set_random_seed function sets seeds across Python's random, NumPy, PyTorch, and Megatron's tensor parallel CUDA manual seed to ensure deterministic behavior across all components.

Usage

Call initialize_megatron at the start of training before any model creation or data loading. This is typically invoked once per process in a distributed training job. The function is idempotent with respect to distributed initialization (it checks is_distribute_initialized first) but will always reset random seeds.

Code Reference

Source Location

Signature

def is_distribute_initialized() -> bool: ...

def _set_random_seed(seed_: int) -> None: ...

def initialize_megatron(args: TrainingArguments) -> None: ...

def _initialize_distributed(args: TrainingArguments) -> None: ...

Import

from mcore_adapter.initialize import initialize_megatron, is_distribute_initialized

I/O Contract

Inputs

Name Type Required Description
args TrainingArguments Yes Training configuration containing seed, device info, parallelism sizes (tensor_model_parallel_size, pipeline_model_parallel_size, virtual_pipeline_model_parallel_size, context_parallel_size, expert_model_parallel_size), ddp_backend, and ddp_timeout_delta
seed_ int Yes (for _set_random_seed) Positive integer seed for reproducibility

Outputs

Name Type Description
(is_distribute_initialized) bool Whether Megatron model parallelism has been initialized
(initialize_megatron) None Side effects: initializes distributed process groups and sets random seeds
(_initialize_distributed) None Side effects: initializes torch.distributed and Megatron model parallel groups
(_set_random_seed) None Side effects: sets random seeds in random, numpy, torch, and tensor_parallel

Usage Examples

from mcore_adapter.initialize import initialize_megatron
from mcore_adapter.training_args import TrainingArguments

# Create training arguments with parallelism config
args = TrainingArguments(
    seed=42,
    tensor_model_parallel_size=4,
    pipeline_model_parallel_size=2,
    expert_model_parallel_size=1,
)

# Initialize Megatron distributed environment
initialize_megatron(args)
# Now torch.distributed and Megatron model parallel groups are ready

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment