Principle:Alibaba ROLL MCoreAdapter Distributed Training

Knowledge Sources	Alibaba_ROLL
Domains	Training, Distributed_Computing
Last Updated	2026-02-07 20:00 GMT

Overview

An orchestration layer that combines a distributed parallel training engine's pipeline scheduling with a standard training framework's lifecycle management, enabling multi-dimensional parallelism with familiar training APIs.

Description

Training large language models at scale requires coordinating multiple forms of parallelism simultaneously: data parallelism (replicating the model across GPUs), tensor parallelism (splitting individual layers), pipeline parallelism (splitting the model into sequential stages), and expert parallelism (distributing MoE experts). Each form of parallelism has its own communication patterns, and the training loop must orchestrate forward and backward passes across all of them correctly.

This principle describes a trainer that bridges two paradigms:

Pipeline-Parallel Forward/Backward Scheduling: Instead of the standard single-GPU forward-backward-optimize cycle, the trainer uses a pipeline scheduling function that interleaves micro-batches across pipeline stages. This function handles the complex communication of activations and gradients between stages, including support for virtual pipeline parallelism (interleaved scheduling) that reduces pipeline bubble time.

Distributed Data Parallel Wrapping: Each model chunk (one per virtual pipeline stage) is wrapped in a DistributedDataParallel (DDP) module that handles gradient all-reduce across data-parallel ranks. The trainer manages overlapped gradient reduction (computing communication while backward pass continues), overlapped parameter gathering (prefetching parameters for the next forward step), and distributed optimizer state sharding.

Sequence Packing: For efficient GPU utilization, variable-length sequences can be packed into fixed-length batches. The trainer translates packed sequences into the format expected by the attention mechanism (THD format with cumulative sequence lengths), enabling cross-sequence attention masking through packed sequence parameters.

Distributed Checkpointing: The trainer saves and loads optimizer state using a fully-parallel sharded strategy that distributes I/O across data-parallel ranks, minimizing checkpoint time at scale.

Usage

Use this principle when:

Training with any combination of tensor, pipeline, expert, and data parallelism using a pipeline-scheduled forward/backward pass.
The training loop must support gradient accumulation across multiple micro-batches with pipeline scheduling.
You need a training framework that supports sequence packing with Transformer Engine attention kernels.
Checkpoint saving and loading must scale with the number of GPUs via sharded distributed checkpointing.

Theoretical Basis

Pipeline-parallel training step:

FOR each model_chunk in wrapped_models:
    model_chunk.zero_grad_buffer()
optimizer.zero_grad()

metrics = forward_backward_func(
    forward_step_func = inner_forward_step,
    data_iterator     = data_iterator,
    model             = wrapped_models,
    num_microbatches   = gradient_accumulation_steps,
    seq_length         = seq_length,
    micro_batch_size   = per_device_batch_size,
    forward_only       = False
)

update_successful, grad_norm = optimizer.step()
IF update_successful:
    lr_scheduler.step()

Loss computation with context parallelism:

When context parallelism splits the sequence across $C$ ranks:

$loss = \frac{\sum_{c = 1}^{C} \sum_{i} ℓ (y_{i}^{(c)}, {\hat{y}}_{i}^{(c)}) \cdot m_{i}^{(c)}}{\sum_{c = 1}^{C} \sum_{i} m_{i}^{(c)}}$

where $m_{i}^{(c)}$ is the loss mask for token $i$ on rank $c$ . Both numerator and denominator are all-reduced across the context-parallel group before division.

Gradient synchronization with overlap:

config.no_sync_func = [model.no_sync for model in wrapped_models]
config.grad_sync_func = [model.start_grad_sync for model in wrapped_models]
config.finalize_model_grads_func = finalize_model_grads
config.grad_scale_func = optimizer.scale_loss

This enables overlapping gradient all-reduce with backward computation across pipeline stages.

Sequence packing format:

packed_seq_params = PackedSeqParams(
    qkv_format = "thd",
    cu_seqlens_q  = cumulative_sequence_lengths,
    cu_seqlens_kv = cumulative_sequence_lengths,
    max_seqlen_q  = max_sequence_length,
    max_seqlen_kv = max_sequence_length
)

Related Pages

Implementation:Alibaba_ROLL_McaTrainer

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment