Principle:Alibaba ROLL MCoreAdapter Training Configuration

Knowledge Sources	Alibaba_ROLL
Domains	Configuration, Training, Distributed_Computing
Last Updated	2026-02-07 20:00 GMT

Overview

A hierarchical argument specification that merges distributed parallelism parameters with standard training hyperparameters into a single dataclass, enforcing cross-cutting constraints between parallelism and optimization settings.

Description

Large-scale distributed training requires configuring two distinct but interrelated sets of parameters: the distributed parallelism layout (how the model is partitioned across GPUs) and the training hyperparameters (learning rate, batch size, optimizer settings). These two sets of parameters interact in non-obvious ways: for example, enabling distributed optimizer requires enabling overlapped gradient reduction, and using expert parallelism with tensor parallelism forces sequence parallelism on.

This principle defines a configuration hierarchy that:

DistributingParallelArguments: The base level captures parallelism dimensions (tensor, pipeline, expert, context), activation recomputation strategy (full, selective, per-module), MoE routing configuration (dispatcher type, capacity factor, token drop policy), and FP8 training settings. It also supports an additional_configs escape hatch for passing arbitrary key-value pairs to the model config.

MegatronArguments: Extends the parallelism base with distributed training options specific to the engine: distributed optimizer, overlapped gradient reduction, overlapped parameter gathering, DDP bucket sizing, optimizer CPU offloading, and sequence packing. It enforces that overlapped parameter gathering requires both distributed optimizer and overlapped gradient reduction.

TrainingArguments: Combines the Megatron-specific arguments with a standard HuggingFace TrainingArguments class through multiple inheritance. This final class provides a unified interface where users can set both standard training parameters (learning rate, batch size, number of epochs) and distributed parallelism parameters in a single argument parser invocation.

The hierarchy enforces constraints at initialization time through __post_init__ validation, ensuring configurations are self-consistent before any GPU resources are allocated.

Usage

Use this principle when:

Building a training system that must expose both parallelism configuration and standard training hyperparameters through a single command-line interface.
You need compile-time validation of interdependent distributed training parameters (e.g., overlap strategies require specific optimizer modes).
The configuration must be parseable by standard argument parsing frameworks (HfArgumentParser) while supporting Megatron-specific extensions.

Theoretical Basis

Configuration inheritance hierarchy:

DistributingParallelArguments
    |-- tensor_model_parallel_size
    |-- pipeline_model_parallel_size
    |-- expert_model_parallel_size
    |-- context_parallel_size
    |-- recompute_granularity: {full, selective}
    |-- moe_token_dispatcher_type: {allgather, alltoall}
    |-- fp8_recipe, fp8_param
    |
    v
MegatronArguments (extends above)
    |-- use_distributed_optimizer
    |-- overlap_grad_reduce
    |-- overlap_param_gather
    |-- optimizer_cpu_offload
    |-- sequence_packing
    |
    v
TrainingArguments (extends above + HFTrainingArguments)
    |-- learning_rate, batch_size, num_epochs, ...

Cross-parameter constraints enforced at __post_init__:

ASSERT overlap_param_gather IMPLIES use_distributed_optimizer
ASSERT overlap_param_gather IMPLIES overlap_grad_reduce
ASSERT variable_seq_lengths IMPLIES moe_token_dispatcher_type != "allgather"
IF bf16: accumulate_allreduce_grads_in_fp32 = True
IF pipeline_model_parallel_layout is set:
    virtual_pipeline_parallel_size = num_stages / pp_size

Total training batch size computation:

Failed to parse (syntax error): {\displaystyle \text{total\_batch} = \text{micro\_batch} \times \text{grad\_accum\_steps} \times \text{dp\_size}}

where Failed to parse (syntax error): {\displaystyle \text{dp\_size} = \frac{\text{world\_size}}{\text{tp} \times \text{pp} \times \text{ep} \times \text{cp}}}

Related Pages

Implementation:Alibaba_ROLL_TrainingArguments

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment