Environment:Alibaba ROLL Megatron Training Environment

Knowledge Sources	Alibaba ROLL Megatron-Core
Domains	Infrastructure, Distributed_Training
Last Updated	2026-02-07 19:00 GMT

Overview

NVIDIA Megatron-Core training backend environment with support for Tensor Parallelism, Pipeline Parallelism, Context Parallelism, and Expert Parallelism for large-scale model training.

Description

This environment provides the Megatron-Core distributed training backend for ROLL via the MCoreAdapter bridge package. Megatron-Core enables advanced model parallelism strategies including Tensor Parallelism (TP), Pipeline Parallelism (PP), Virtual Pipeline Parallelism (VPP), Context Parallelism (CP), Expert Parallelism (EP), and Sequence Parallelism (SP). The MCoreAdapter translates between HuggingFace model configs and Megatron-Core's TransformerConfig. Training supports distributed optimizer, gradient recomputation, overlapped gradient reduction, and MoE-specific optimizations.

Usage

Use this environment when training with the megatron strategy backend. Required for sequence packing, context parallelism, and advanced parallelism configurations. The Megatron backend is the most performant training option for large-scale distributed training but is NVIDIA CUDA-only.

System Requirements

Category	Requirement	Notes
Hardware	NVIDIA GPU with CUDA	AMD ROCm and Ascend NPU not supported for Megatron
Multi-GPU	Recommended for TP/PP/CP	Single GPU possible but limited
Network	High-bandwidth interconnect	NVLink/InfiniBand for multi-node TP

Dependencies

Python Packages

`megatron-core` >= 0.13.0, < 0.14.0
`transformers` >= 4.50.0
`accelerate` >= 0.27.2
`transformer-engine[pytorch]` == 2.2.0 (for torch 2.6.0)
`flash-attn` (required for Ulysses context parallelism)
`mcore_adapter` (ROLL's bridge package, installed via `./mcore_adapter`)

Credentials

`NVTE_BWD_LAYERNORM_SM_MARGIN`: Transformer Engine SM margin (default `0`)
`NVTE_TORCH_COMPILE`: Set to `0` to disable TE torch.compile (fixes BackendCompilerFailed error)
`NVTE_FLASH_ATTN`: Set to `1` for Flash Attention with Context Parallel

Quick Install

# Install MCoreAdapter (includes megatron-core dependency)
pip install -e ./mcore_adapter

# Install Transformer Engine
pip install transformer-engine[pytorch]==2.2.0

# Install Flash Attention
pip install flash-attn

Code Evidence

MCoreAdapter requirements from `mcore_adapter/requirements.txt:1-3`:

megatron-core>=0.13.0,<0.14.0
transformers>=4.50.0
accelerate>=0.27.2

Overlap param gather assertion from `roll/distributed/strategy/megatron_strategy.py:99`:

assert not self.megatron_train_args.overlap_param_gather, "overlap_param_gather is not supported"

PyTorch version check for process group from `roll/utils/collective/pg_utils.py:60`:

pg_options_param_name = "backend_options" if str(torch.__version__) >= "2.6" else "pg_options"

Common Errors

Error Message	Cause	Solution
`overlap_param_gather is not supported`	Unsupported Megatron option	Disable `overlap_param_gather` in config
`BackendCompilerFailed`	Transformer Engine compile issue	Set `NVTE_TORCH_COMPILE: '0'` in system_envs
Batch size assertion errors	Alignment mismatch	Ensure `rollout_batch_size * num_return_sequences` is divisible by `grad_accum * micro_batch * (world_size/TP/PP/CP)`

Compatibility Notes

NVIDIA only: Megatron-Core requires NVIDIA CUDA. Not available on AMD ROCm or Ascend NPU.
Sequence Packing: Only supported with megatron_strategy; requires alignment to `2 * CP_SIZE * TP_SIZE`.
MoE Models: Use `moe_token_dispatcher_type`, `moe_grouped_gemm`, and `moe_layer_recompute` options.
Checkpoint Format: Megatron format by default; convert to HuggingFace with `mcore_adapter/tools/convert.py`.
FSDP2 Alternative: For non-NVIDIA or simpler setups, consider FSDP2 backend.

Related Pages

Implementation:Alibaba_ROLL_MegatronTrainStrategy_Train_Step

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment