Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:CarperAI Trlx NeMo Megatron

From Leeroopedia


Knowledge Sources
Domains Infrastructure, Distributed_Training, NLP
Last Updated 2026-02-07 18:00 GMT

Overview

NVIDIA NeMo Toolkit r1.15.0 with Megatron-LM and Apex for large-scale model-parallel RLHF training on multi-node GPU clusters.

Description

This environment provides the runtime context for NeMo-based trainers and models in trlx. It extends the base Python/Accelerate environment with NVIDIA's NeMo Toolkit for Megatron-style model parallelism (tensor parallel, pipeline parallel), NVIDIA Apex for fused kernels and mixed-precision training, and the Megatron batch sampler for distributed data loading. NeMo trainers are registered as optional backends; if NeMo is not installed, they raise ImportError at registration time.

Usage

Use this environment when running NeMo-based PPO, ILQL, or SFT training at scale (1.3B to 65B+ parameters) with tensor and pipeline parallelism. It is required for all NeMo model variants (NeMoPPOModel, NeMoILQLModel, NeMoSFTModel) and their corresponding trainers. This is a separate environment from the Accelerate-based stack and typically runs on HPC clusters with SLURM.

System Requirements

Category Requirement Notes
OS Linux HPC cluster with SLURM recommended
Python 3.9 - 3.10 NeMo r1.15.0 compatibility
Hardware Multi-GPU NVIDIA (A100/H100 recommended) Tensor parallelism requires NVLink
CUDA 11.7 - 11.8 Required by Apex and NeMo
Disk 100GB+ Large model checkpoints in .nemo format

Dependencies

System Packages

  • CUDA Toolkit 11.7 or 11.8
  • NVIDIA drivers compatible with CUDA 11.x
  • NCCL for multi-node communication

Python Packages (Core)

  • `nemo_toolkit[all]` == r1.15.0 (NVIDIA NeMo)
  • `apex` (NVIDIA Apex with CUDA extensions)
  • `torch` >= 1.13.0 (with CUDA)
  • `transformers` >= 4.27.1
  • `einops` >= 0.4.1
  • `wandb` >= 0.13.5
  • `omegaconf` (NeMo configuration)
  • `pytorch-lightning` (NeMo training backend)

Build from Source (Required)

Apex must be built from source with CUDA extensions:

git clone https://github.com/NVIDIA/apex/
cd apex
pip install -v --disable-pip-version-check --no-cache-dir \
  --global-option="--cpp_ext" \
  --global-option="--cuda_ext" \
  --global-option="--fast_layer_norm" \
  --global-option="--distributed_adam" \
  --global-option="--deprecated_fused_adam" ./

NeMo must be installed from source at the correct version:

git clone https://github.com/NVIDIA/NeMo/
cd NeMo
git checkout r1.15.0
pip install '.[all]'

Credentials

  • `WANDB_API_KEY`: Weights & Biases API key for experiment tracking
  • `HF_TOKEN`: HuggingFace API token for accessing gated models and checkpoints

Distributed Training Variables

  • `WORLD_SIZE`: Total number of processes (set by SLURM/torchrun)
  • `LOCAL_RANK`: Local GPU rank (auto-set)
  • `RANK`: Global process rank (auto-set)

Quick Install

# 1. Create conda environment
conda env create -f env.yaml

# 2. Install NeMo r1.15.0
git clone https://github.com/NVIDIA/NeMo/ && cd NeMo
git checkout r1.15.0
pip install '.[all]'

# 3. Build Apex from source
git clone https://github.com/NVIDIA/apex/ && cd apex
pip install -v --disable-pip-version-check --no-cache-dir \
  --global-option="--cpp_ext" --global-option="--cuda_ext" \
  --global-option="--fast_layer_norm" --global-option="--distributed_adam" \
  --global-option="--deprecated_fused_adam" ./

# 4. Install trlx
pip install git+https://github.com/CarperAI/trlx.git

Code Evidence

NeMo optional import handling from `trlx/utils/loading.py:14-28`:

try:
    from trlx.trainer.nemo_ilql_trainer import NeMoILQLTrainer
    from trlx.trainer.nemo_ppo_trainer import NeMoPPOTrainer
    from trlx.trainer.nemo_sft_trainer import NeMoSFTTrainer
except ImportError:
    def _trainers_unavailble(names: List[str]):
        def log_error(*args, **kwargs):
            raise ImportError("NeMo is not installed.")
        for name in names:
            register_trainer(name)(log_error)

NeMo model imports from `trlx/models/modeling_nemo_ppo.py:13-30`:

from apex.transformer import parallel_state, tensor_parallel
from apex.transformer.pipeline_parallel.utils import _reconfigure_microbatch_calculator
from nemo.collections.nlp.models.language_modeling.megatron_gpt_model import MegatronGPTModel
from nemo.collections.nlp.models.language_modeling.megatron_base_model import MegatronBaseModel

Common Errors

Error Message Cause Solution
`ImportError: NeMo is not installed` NeMo toolkit not present Install NeMo r1.15.0 from source (see Quick Install)
`ModuleNotFoundError: apex` NVIDIA Apex not built Build Apex from source with CUDA extensions
`megatron_legacy` config warning Older NeMo checkpoint format Set `megatron_legacy: True` in model config

Compatibility Notes

  • Version pinned: Only NeMo `r1.15.0` is supported. Later versions may have breaking API changes.
  • Separate environment: NeMo and Apex have strict version requirements that may conflict with the base Accelerate environment. A dedicated conda/virtual environment is recommended.
  • Pretrained models: NeMo `.nemo` checkpoints must be un-tarred before use. Set `train.trainer_kwargs.pretrained_model` to the extracted directory path.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment