Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:NVIDIA NeMo Aligner NeMo Framework GPU Environment

From Leeroopedia


Knowledge Sources
Domains Infrastructure, Deep_Learning, Distributed_Training
Last Updated 2026-02-07 22:00 GMT

Overview

Multi-GPU NVIDIA environment with CUDA, Python 3.9/3.10, PyTorch (NGC 24.07), Megatron-Core, NeMo Toolkit, Transformer Engine, and NVIDIA Apex for distributed LLM alignment training.

Description

This is the primary runtime environment for all NeMo-Aligner training workflows (SFT, Reward Model, PPO, DPO, REINFORCE). It is built on the NVIDIA NGC PyTorch 24.07 base container and includes the full CUDA toolkit, NCCL for distributed communication, Megatron-Core for model parallelism (tensor, pipeline, and data parallel), the NeMo Toolkit for NLP model management, Transformer Engine for optimized transformer operations, and NVIDIA Apex for fused optimizers. All operations require CUDA-capable NVIDIA GPUs with NCCL backend for distributed training.

Usage

Use this environment for all NeMo-Aligner workflows: Supervised Fine-Tuning, Reward Model Training, PPO RLHF, DPO, and REINFORCE. It is the mandatory prerequisite for every Implementation page in this wiki. The recommended deployment method is the official NVIDIA Docker container.

System Requirements

Category Requirement Notes
OS Linux (Ubuntu recommended) NGC container uses Ubuntu base
Hardware NVIDIA CUDA GPUs Minimum 2 GPUs for testing, 8 nodes x 8 GPUs for full-scale training
GPU Memory 40GB+ VRAM per GPU recommended A100 40GB/80GB or H100 preferred for large models
Disk 100GB+ SSD For model checkpoints, datasets, and TRT-LLM engine caches
Network High-bandwidth interconnect (NVLink/InfiniBand) Required for multi-node distributed training

Dependencies

System Packages

  • CUDA Toolkit (via NGC container)
  • NCCL (for distributed GPU communication)
  • Git LFS (for large file handling)
  • MPI (`mpi4py`, for unit tests)

Python Packages

  • `torch` (PyTorch, via NGC 24.07 container)
  • `megatron-core` >= 0.8
  • `nemo_toolkit[nlp]`
  • `transformer-engine` (specific commit: `7d576ed`)
  • `apex` (NVIDIA Apex with `--cpp_ext --cuda_ext --fast_layer_norm --distributed_adam --deprecated_fused_adam`)
  • `triton` == 3.1.0
  • `Jinja2` ~= 3.1.4
  • `jsonlines`
  • `protobuf` == 4.24.4
  • `pynvml` == 11.5.3
  • `omegaconf` (Hydra configuration)

Container

  • Base image: `nvcr.io/nvidia/pytorch:24.07-py3`
  • Recommended: `nvcr.io/nvidia/nemo:24.07` or `nvcr.io/nvidia/nemo:24.09`

Credentials

The following environment variables must be set for distributed training:

  • `LOCAL_RANK`: Local GPU rank (set automatically by torchrun/mpirun)
  • `MASTER_ADDR`: Master node address for distributed init (defaults to `localhost`)
  • `MASTER_PORT`: Master node port for distributed init (defaults to `6000`)
  • `CUDA_VISIBLE_DEVICES`: Controls which GPUs are visible to the process
  • `NCCL_ALGO`: NCCL collective algorithm (set to `Tree` for PPO stability)
  • `NVTE_FLASH_ATTN`: Set to `0` to disable flash attention (optional, for debugging)
  • `DISABLE_TORCH_DEVICE_SET`: Automatically set to `1` by NeMo-Aligner to prevent device reassignment in TRT-LLM

Quick Install

# Recommended: Use the official NGC container
docker run --gpus all -it --rm \
    --shm-size=8g \
    --ulimit memlock=-1 \
    --ulimit stack=67108864 \
    nvcr.io/nvidia/nemo:24.07

# Or build from the NeMo-Aligner Dockerfile
git clone https://github.com/NVIDIA/NeMo-Aligner.git
cd NeMo-Aligner
docker buildx build -t aligner:latest .

Code Evidence

Distributed initialization requiring NCCL backend from `nemo_aligner/testing/utils.py:69-71`:

torch.distributed.init_process_group(
    backend="nccl", world_size=Utils.world_size, rank=Utils.rank, store=store
)

LOCAL_RANK environment variable requirement from `nemo_aligner/testing/utils.py:49`:

return int(os.environ["LOCAL_RANK"])

DISABLE_TORCH_DEVICE_SET auto-configuration from `nemo_aligner/__init__.py:17`:

os.environ["DISABLE_TORCH_DEVICE_SET"] = "1"

GPU memory monitoring from `nemo_aligner/utils/utils.py:522-525`:

def log_memory(prefix):
    pyt = torch.cuda.memory_allocated() / (1024 ** 3)
    el = (torch.cuda.mem_get_info()[1] - torch.cuda.mem_get_info()[0]) / (1024 ** 3)
    logging.info(f"Mem Usage (GB) | {prefix} | pytorch:{pyt} total_occupied:{el}")

Dockerfile pynvml pin (breaking change in 12.0.0) from `Dockerfile:73-75`:

# TODO: This pinning of pynvml is only needed while on TRTLLM v13 since pynvml>=11.5.0
# but pynvml==12.0.0 contains a breaking change. The last known working version is 11.5.3
RUN pip install pynvml==11.5.3

Triton version pin from `Dockerfile:110-113`:

# TODO: While we are on Pytorch 24.07, we need to downgrade triton since 3.2.0 introduced
# a breaking change. This un-pinned requirement comes from mamba-ssm.
RUN pip install triton==3.1.0

Common Errors

Error Message Cause Solution
`RuntimeError: LOCAL_RANK not set` Missing distributed training env var Launch with `torchrun` or set `LOCAL_RANK` manually
`NCCL error: unhandled system error` NCCL communication failure Set `export NCCL_ALGO=Tree` and verify network connectivity
`pynvml.NVMLError` pynvml version incompatibility Pin `pynvml==11.5.3` (12.0.0 has breaking changes)
`triton` compilation errors triton 3.2.0 breaking change Pin `triton==3.1.0`
`ImportError: megatron.core` Megatron-Core not installed Install from source: `pip install -e .` in Megatron-LM directory

Compatibility Notes

  • Container Required: The recommended deployment is via NVIDIA NGC containers. Building from source requires careful version pinning of all dependencies.
  • Multi-Node: Default configs assume 8 nodes x 8 GPUs (64 GPUs). Scale down by adjusting `trainer.num_nodes` and `trainer.devices` in YAML configs.
  • Precision: BF16 with `megatron_amp_O2=True` is the standard precision mode. FP16 is not commonly used.
  • Python Version: Officially supports Python 3.9 and 3.10.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment