Environment:NVIDIA NeMo Aligner NeMo Framework GPU Environment

Knowledge Sources	NeMo-Aligner NeMo Framework NeMo-Aligner Dockerfile
Domains	Infrastructure, Deep_Learning, Distributed_Training
Last Updated	2026-02-07 22:00 GMT

Overview

Multi-GPU NVIDIA environment with CUDA, Python 3.9/3.10, PyTorch (NGC 24.07), Megatron-Core, NeMo Toolkit, Transformer Engine, and NVIDIA Apex for distributed LLM alignment training.

Description

This is the primary runtime environment for all NeMo-Aligner training workflows (SFT, Reward Model, PPO, DPO, REINFORCE). It is built on the NVIDIA NGC PyTorch 24.07 base container and includes the full CUDA toolkit, NCCL for distributed communication, Megatron-Core for model parallelism (tensor, pipeline, and data parallel), the NeMo Toolkit for NLP model management, Transformer Engine for optimized transformer operations, and NVIDIA Apex for fused optimizers. All operations require CUDA-capable NVIDIA GPUs with NCCL backend for distributed training.

Usage

Use this environment for all NeMo-Aligner workflows: Supervised Fine-Tuning, Reward Model Training, PPO RLHF, DPO, and REINFORCE. It is the mandatory prerequisite for every Implementation page in this wiki. The recommended deployment method is the official NVIDIA Docker container.

System Requirements

Category	Requirement	Notes
OS	Linux (Ubuntu recommended)	NGC container uses Ubuntu base
Hardware	NVIDIA CUDA GPUs	Minimum 2 GPUs for testing, 8 nodes x 8 GPUs for full-scale training
GPU Memory	40GB+ VRAM per GPU recommended	A100 40GB/80GB or H100 preferred for large models
Disk	100GB+ SSD	For model checkpoints, datasets, and TRT-LLM engine caches
Network	High-bandwidth interconnect (NVLink/InfiniBand)	Required for multi-node distributed training

Dependencies

System Packages

CUDA Toolkit (via NGC container)
NCCL (for distributed GPU communication)
Git LFS (for large file handling)
MPI (`mpi4py`, for unit tests)

Python Packages

`torch` (PyTorch, via NGC 24.07 container)
`megatron-core` >= 0.8
`nemo_toolkit[nlp]`
`transformer-engine` (specific commit: `7d576ed`)
`apex` (NVIDIA Apex with `--cpp_ext --cuda_ext --fast_layer_norm --distributed_adam --deprecated_fused_adam`)
`triton` == 3.1.0
`Jinja2` ~= 3.1.4
`jsonlines`
`protobuf` == 4.24.4
`pynvml` == 11.5.3
`omegaconf` (Hydra configuration)

Container

Base image: `nvcr.io/nvidia/pytorch:24.07-py3`
Recommended: `nvcr.io/nvidia/nemo:24.07` or `nvcr.io/nvidia/nemo:24.09`

Credentials

The following environment variables must be set for distributed training:

`LOCAL_RANK`: Local GPU rank (set automatically by torchrun/mpirun)
`MASTER_ADDR`: Master node address for distributed init (defaults to `localhost`)
`MASTER_PORT`: Master node port for distributed init (defaults to `6000`)
`CUDA_VISIBLE_DEVICES`: Controls which GPUs are visible to the process
`NCCL_ALGO`: NCCL collective algorithm (set to `Tree` for PPO stability)
`NVTE_FLASH_ATTN`: Set to `0` to disable flash attention (optional, for debugging)
`DISABLE_TORCH_DEVICE_SET`: Automatically set to `1` by NeMo-Aligner to prevent device reassignment in TRT-LLM

Quick Install

# Recommended: Use the official NGC container
docker run --gpus all -it --rm \
    --shm-size=8g \
    --ulimit memlock=-1 \
    --ulimit stack=67108864 \
    nvcr.io/nvidia/nemo:24.07

# Or build from the NeMo-Aligner Dockerfile
git clone https://github.com/NVIDIA/NeMo-Aligner.git
cd NeMo-Aligner
docker buildx build -t aligner:latest .

Code Evidence

Distributed initialization requiring NCCL backend from `nemo_aligner/testing/utils.py:69-71`:

torch.distributed.init_process_group(
    backend="nccl", world_size=Utils.world_size, rank=Utils.rank, store=store
)

LOCAL_RANK environment variable requirement from `nemo_aligner/testing/utils.py:49`:

return int(os.environ["LOCAL_RANK"])

DISABLE_TORCH_DEVICE_SET auto-configuration from `nemo_aligner/__init__.py:17`:

os.environ["DISABLE_TORCH_DEVICE_SET"] = "1"

GPU memory monitoring from `nemo_aligner/utils/utils.py:522-525`:

def log_memory(prefix):
    pyt = torch.cuda.memory_allocated() / (1024 ** 3)
    el = (torch.cuda.mem_get_info()[1] - torch.cuda.mem_get_info()[0]) / (1024 ** 3)
    logging.info(f"Mem Usage (GB) | {prefix} | pytorch:{pyt} total_occupied:{el}")

Dockerfile pynvml pin (breaking change in 12.0.0) from `Dockerfile:73-75`:

# TODO: This pinning of pynvml is only needed while on TRTLLM v13 since pynvml>=11.5.0
# but pynvml==12.0.0 contains a breaking change. The last known working version is 11.5.3
RUN pip install pynvml==11.5.3

Triton version pin from `Dockerfile:110-113`:

# TODO: While we are on Pytorch 24.07, we need to downgrade triton since 3.2.0 introduced
# a breaking change. This un-pinned requirement comes from mamba-ssm.
RUN pip install triton==3.1.0

Common Errors

Error Message	Cause	Solution
`RuntimeError: LOCAL_RANK not set`	Missing distributed training env var	Launch with `torchrun` or set `LOCAL_RANK` manually
`NCCL error: unhandled system error`	NCCL communication failure	Set `export NCCL_ALGO=Tree` and verify network connectivity
`pynvml.NVMLError`	pynvml version incompatibility	Pin `pynvml==11.5.3` (12.0.0 has breaking changes)
`triton` compilation errors	triton 3.2.0 breaking change	Pin `triton==3.1.0`
`ImportError: megatron.core`	Megatron-Core not installed	Install from source: `pip install -e .` in Megatron-LM directory

Compatibility Notes

Container Required: The recommended deployment is via NVIDIA NGC containers. Building from source requires careful version pinning of all dependencies.
Multi-Node: Default configs assume 8 nodes x 8 GPUs (64 GPUs). Scale down by adjusting `trainer.num_nodes` and `trainer.devices` in YAML configs.
Precision: BF16 with `megatron_amp_O2=True` is the standard precision mode. FP16 is not commonly used.
Python Version: Officially supports Python 3.9 and 3.10.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment