Environment:Alibaba ROLL Megatron Training Environment
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Distributed_Training |
| Last Updated | 2026-02-07 19:00 GMT |
Overview
NVIDIA Megatron-Core training backend environment with support for Tensor Parallelism, Pipeline Parallelism, Context Parallelism, and Expert Parallelism for large-scale model training.
Description
This environment provides the Megatron-Core distributed training backend for ROLL via the MCoreAdapter bridge package. Megatron-Core enables advanced model parallelism strategies including Tensor Parallelism (TP), Pipeline Parallelism (PP), Virtual Pipeline Parallelism (VPP), Context Parallelism (CP), Expert Parallelism (EP), and Sequence Parallelism (SP). The MCoreAdapter translates between HuggingFace model configs and Megatron-Core's TransformerConfig. Training supports distributed optimizer, gradient recomputation, overlapped gradient reduction, and MoE-specific optimizations.
Usage
Use this environment when training with the megatron strategy backend. Required for sequence packing, context parallelism, and advanced parallelism configurations. The Megatron backend is the most performant training option for large-scale distributed training but is NVIDIA CUDA-only.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| Hardware | NVIDIA GPU with CUDA | AMD ROCm and Ascend NPU not supported for Megatron |
| Multi-GPU | Recommended for TP/PP/CP | Single GPU possible but limited |
| Network | High-bandwidth interconnect | NVLink/InfiniBand for multi-node TP |
Dependencies
Python Packages
- `megatron-core` >= 0.13.0, < 0.14.0
- `transformers` >= 4.50.0
- `accelerate` >= 0.27.2
- `transformer-engine[pytorch]` == 2.2.0 (for torch 2.6.0)
- `flash-attn` (required for Ulysses context parallelism)
- `mcore_adapter` (ROLL's bridge package, installed via `./mcore_adapter`)
Credentials
- `NVTE_BWD_LAYERNORM_SM_MARGIN`: Transformer Engine SM margin (default `0`)
- `NVTE_TORCH_COMPILE`: Set to `0` to disable TE torch.compile (fixes BackendCompilerFailed error)
- `NVTE_FLASH_ATTN`: Set to `1` for Flash Attention with Context Parallel
Quick Install
# Install MCoreAdapter (includes megatron-core dependency)
pip install -e ./mcore_adapter
# Install Transformer Engine
pip install transformer-engine[pytorch]==2.2.0
# Install Flash Attention
pip install flash-attn
Code Evidence
MCoreAdapter requirements from `mcore_adapter/requirements.txt:1-3`:
megatron-core>=0.13.0,<0.14.0
transformers>=4.50.0
accelerate>=0.27.2
Overlap param gather assertion from `roll/distributed/strategy/megatron_strategy.py:99`:
assert not self.megatron_train_args.overlap_param_gather, "overlap_param_gather is not supported"
PyTorch version check for process group from `roll/utils/collective/pg_utils.py:60`:
pg_options_param_name = "backend_options" if str(torch.__version__) >= "2.6" else "pg_options"
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `overlap_param_gather is not supported` | Unsupported Megatron option | Disable `overlap_param_gather` in config |
| `BackendCompilerFailed` | Transformer Engine compile issue | Set `NVTE_TORCH_COMPILE: '0'` in system_envs |
| Batch size assertion errors | Alignment mismatch | Ensure `rollout_batch_size * num_return_sequences` is divisible by `grad_accum * micro_batch * (world_size/TP/PP/CP)` |
Compatibility Notes
- NVIDIA only: Megatron-Core requires NVIDIA CUDA. Not available on AMD ROCm or Ascend NPU.
- Sequence Packing: Only supported with megatron_strategy; requires alignment to `2 * CP_SIZE * TP_SIZE`.
- MoE Models: Use `moe_token_dispatcher_type`, `moe_grouped_gemm`, and `moe_layer_recompute` options.
- Checkpoint Format: Megatron format by default; convert to HuggingFace with `mcore_adapter/tools/convert.py`.
- FSDP2 Alternative: For non-NVIDIA or simpler setups, consider FSDP2 backend.