Heuristic:NVIDIA NeMo Aligner Adam State Offloading Tip

Knowledge Sources	NeMo-Aligner NVIDIA Internal
Domains	Optimization, Memory_Management, PPO
Last Updated	2026-02-07 22:00 GMT

Overview

Memory optimization technique that offloads distributed Adam optimizer states (parameter shards, exponential averages) to CPU during the generation/rollout phase to free GPU VRAM.

Description

In PPO and REINFORCE training, the generation (rollout) phase and the training phase alternate. During generation, the optimizer states are not needed but still occupy significant GPU memory. NeMo-Aligner provides a context manager `offload_distributed_adam()` that moves the Adam optimizer's state tensors (`params_shard`, `param_remainders_shard`, `exp_avg_shard`, `exp_avg_sq_shard`) to CPU memory before generation and moves them back to GPU before the training step. This can free several gigabytes of VRAM, allowing larger batch sizes or longer sequences during rollout.

Usage

Use this heuristic when running PPO or REINFORCE training and encountering OOM errors during the rollout/generation phase. Enable via `model.ppo.offload_adam_states=True` (or equivalent REINFORCE config). This is especially important for large models (7B+) where optimizer states can consume 2-4x the model parameter memory.

The Insight (Rule of Thumb)

Action: Set `offload_adam_states: True` in the PPO/REINFORCE actor config.
Value: Boolean flag.
Trade-off: Adds CPU-GPU transfer overhead at the boundary between generation and training phases. The transfer is non-blocking but requires a `cuda.synchronize()` barrier.
Memory Savings: Frees approximately 2x model size in VRAM (Adam stores 2 running averages per parameter).

Reasoning

Distributed Adam optimizer maintains 4 state tensors per parameter bucket: the parameter shard itself, parameter remainders (for mixed precision), the first moment (exp_avg), and the second moment (exp_avg_sq). For a 7B parameter model with BF16 training, this can occupy 28GB+ of VRAM. Since these states are only needed during the backward pass and optimizer step, offloading them during the forward-only generation phase is a safe and effective memory optimization.

Code Evidence

Offloading context manager from `nemo_aligner/utils/utils.py:266-299`:

def dist_adam_load_state_bucket_into_device(state_bucket, device):
    """put the state bucket onto a device"""
    attrs_to_offload = ["params_shard", "param_remainders_shard", "exp_avg_shard", "exp_avg_sq_shard"]
    for attr in attrs_to_offload:
        tensor = getattr(state_bucket, attr)
        if tensor is not None:
            setattr(state_bucket, attr, tensor.to(device=device, non_blocking=True))

@contextmanager
def offload_distributed_adam(state_dict, force_clear_memory=False):
    """context manager to offload distributed adam states"""
    for state_bucket in state_dict["state"]["buckets"]:
        dist_adam_load_state_bucket_into_device(state_bucket, device="cpu")
    torch.cuda.synchronize()
    if force_clear_memory:
        clear_memory()
    try:
        yield
    finally:
        for state_bucket in state_dict["state"]["buckets"]:
            dist_adam_load_state_bucket_into_device(state_bucket, device=torch.cuda.current_device())
        torch.cuda.synchronize()

Config default from `examples/nlp/gpt/conf/gpt_ppo_actor.yaml:152`:

offload_adam_states: True

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment