Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Alibaba ROLL MCoreAdapter Distributed Initialization

From Leeroopedia


Knowledge Sources
Domains Distributed_Computing, Initialization
Last Updated 2026-02-07 20:00 GMT

Overview

Distributed environment initialization that establishes process groups for multi-dimensional model parallelism and configures reproducible random number generation across all parallel ranks.

Description

Before distributed training can begin, the runtime must establish communication channels (process groups) that define how GPUs coordinate during training. In multi-dimensional parallelism, a single training job uses several overlapping process groups simultaneously:

  • Tensor model parallel group: GPUs that share the same layer but hold different weight shards
  • Pipeline model parallel group: GPUs that hold different sequential stages of the model
  • Context parallel group: GPUs that hold different sequence chunks for the same model
  • Expert model parallel group: GPUs that hold different MoE experts
  • Data parallel group: GPUs that hold complete model replicas and synchronize gradients

This principle describes the initialization sequence that:

  1. Torch Distributed Initialization: If not already initialized, sets the current device, initializes the torch.distributed process group with the platform-appropriate backend (NCCL for NVIDIA, HCCL for Ascend), using RANK and WORLD_SIZE from environment variables.
  1. Model Parallel Initialization: Calls mpu.initialize_model_parallel with the parallelism dimensions from the training arguments. This function creates all the necessary process groups by computing which ranks belong to which groups based on the parallelism configuration. The total world size must equal TP×PP×CP×EP×DP.
  1. Random Seed Management: Sets seeds for Python's random module, NumPy, PyTorch, and the CUDA RNG tracker. The CUDA RNG tracker is particularly important because it manages per-layer random states for tensor-parallel dropout, ensuring that different tensor-parallel ranks use different random masks while being reproducible across restarts.
  1. Idempotent Initialization: The initialization checks whether model parallelism is already initialized and skips re-initialization, allowing the function to be called safely from multiple code paths (e.g., both from model loading and trainer construction).

Usage

Use this principle when:

  • Setting up a distributed training environment that uses any combination of tensor, pipeline, context, and expert parallelism.
  • You need reproducible training across restarts with correct per-rank random state management.
  • The initialization must be idempotent and safe to call multiple times from different components.

Theoretical Basis

Process group topology:

For a cluster of N=TP×PP×CP×EP×DP GPUs:

initialize_model_parallel(
    tensor_model_parallel_size  = TP,
    pipeline_model_parallel_size = PP,
    virtual_pipeline_model_parallel_size = VPP,
    context_parallel_size       = CP,
    expert_model_parallel_size  = EP
)

The data parallel size is computed implicitly:

DP=NTP×PP×CP×EP

Initialization sequence:

1. IF NOT torch.distributed.is_initialized():
       current_platform.set_device(device)
       torch.distributed.init_process_group(
           backend = current_platform.communication_backend,
           rank    = RANK,
           world_size = WORLD_SIZE
       )

2. IF NOT mpu.model_parallel_is_initialized():
       mpu.initialize_model_parallel(TP, PP, VPP, CP, EP)

3. seed = training_args.seed
   random.seed(seed)
   numpy.random.seed(seed)
   torch.manual_seed(seed)
   tensor_parallel.model_parallel_cuda_manual_seed(seed)

Random state management:

The CUDA RNG tracker maintains separate random states indexed by name (e.g., model-parallel-rng). This ensures:

  • Different TP ranks use different random streams for dropout (so that when gathered, the full tensor has diverse dropout masks)
  • The same TP rank produces identical random streams across checkpoints (for reproducibility)

Failed to parse (syntax error): {\displaystyle \text{rng\_state}[r] = f(\text{seed}, r) \quad \text{where } r \text{ is the TP rank}}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment