Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Alibaba ROLL Megatron Training Environment

From Leeroopedia


Knowledge Sources
Domains Infrastructure, Distributed_Training
Last Updated 2026-02-07 19:00 GMT

Overview

NVIDIA Megatron-Core training backend environment with support for Tensor Parallelism, Pipeline Parallelism, Context Parallelism, and Expert Parallelism for large-scale model training.

Description

This environment provides the Megatron-Core distributed training backend for ROLL via the MCoreAdapter bridge package. Megatron-Core enables advanced model parallelism strategies including Tensor Parallelism (TP), Pipeline Parallelism (PP), Virtual Pipeline Parallelism (VPP), Context Parallelism (CP), Expert Parallelism (EP), and Sequence Parallelism (SP). The MCoreAdapter translates between HuggingFace model configs and Megatron-Core's TransformerConfig. Training supports distributed optimizer, gradient recomputation, overlapped gradient reduction, and MoE-specific optimizations.

Usage

Use this environment when training with the megatron strategy backend. Required for sequence packing, context parallelism, and advanced parallelism configurations. The Megatron backend is the most performant training option for large-scale distributed training but is NVIDIA CUDA-only.

System Requirements

Category Requirement Notes
Hardware NVIDIA GPU with CUDA AMD ROCm and Ascend NPU not supported for Megatron
Multi-GPU Recommended for TP/PP/CP Single GPU possible but limited
Network High-bandwidth interconnect NVLink/InfiniBand for multi-node TP

Dependencies

Python Packages

  • `megatron-core` >= 0.13.0, < 0.14.0
  • `transformers` >= 4.50.0
  • `accelerate` >= 0.27.2
  • `transformer-engine[pytorch]` == 2.2.0 (for torch 2.6.0)
  • `flash-attn` (required for Ulysses context parallelism)
  • `mcore_adapter` (ROLL's bridge package, installed via `./mcore_adapter`)

Credentials

  • `NVTE_BWD_LAYERNORM_SM_MARGIN`: Transformer Engine SM margin (default `0`)
  • `NVTE_TORCH_COMPILE`: Set to `0` to disable TE torch.compile (fixes BackendCompilerFailed error)
  • `NVTE_FLASH_ATTN`: Set to `1` for Flash Attention with Context Parallel

Quick Install

# Install MCoreAdapter (includes megatron-core dependency)
pip install -e ./mcore_adapter

# Install Transformer Engine
pip install transformer-engine[pytorch]==2.2.0

# Install Flash Attention
pip install flash-attn

Code Evidence

MCoreAdapter requirements from `mcore_adapter/requirements.txt:1-3`:

megatron-core>=0.13.0,<0.14.0
transformers>=4.50.0
accelerate>=0.27.2

Overlap param gather assertion from `roll/distributed/strategy/megatron_strategy.py:99`:

assert not self.megatron_train_args.overlap_param_gather, "overlap_param_gather is not supported"

PyTorch version check for process group from `roll/utils/collective/pg_utils.py:60`:

pg_options_param_name = "backend_options" if str(torch.__version__) >= "2.6" else "pg_options"

Common Errors

Error Message Cause Solution
`overlap_param_gather is not supported` Unsupported Megatron option Disable `overlap_param_gather` in config
`BackendCompilerFailed` Transformer Engine compile issue Set `NVTE_TORCH_COMPILE: '0'` in system_envs
Batch size assertion errors Alignment mismatch Ensure `rollout_batch_size * num_return_sequences` is divisible by `grad_accum * micro_batch * (world_size/TP/PP/CP)`

Compatibility Notes

  • NVIDIA only: Megatron-Core requires NVIDIA CUDA. Not available on AMD ROCm or Ascend NPU.
  • Sequence Packing: Only supported with megatron_strategy; requires alignment to `2 * CP_SIZE * TP_SIZE`.
  • MoE Models: Use `moe_token_dispatcher_type`, `moe_grouped_gemm`, and `moe_layer_recompute` options.
  • Checkpoint Format: Megatron format by default; convert to HuggingFace with `mcore_adapter/tools/convert.py`.
  • FSDP2 Alternative: For non-NVIDIA or simpler setups, consider FSDP2 backend.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment