Heuristic:NVIDIA NeMo Aligner Adam State Offloading Tip
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Memory_Management, PPO |
| Last Updated | 2026-02-07 22:00 GMT |
Overview
Memory optimization technique that offloads distributed Adam optimizer states (parameter shards, exponential averages) to CPU during the generation/rollout phase to free GPU VRAM.
Description
In PPO and REINFORCE training, the generation (rollout) phase and the training phase alternate. During generation, the optimizer states are not needed but still occupy significant GPU memory. NeMo-Aligner provides a context manager `offload_distributed_adam()` that moves the Adam optimizer's state tensors (`params_shard`, `param_remainders_shard`, `exp_avg_shard`, `exp_avg_sq_shard`) to CPU memory before generation and moves them back to GPU before the training step. This can free several gigabytes of VRAM, allowing larger batch sizes or longer sequences during rollout.
Usage
Use this heuristic when running PPO or REINFORCE training and encountering OOM errors during the rollout/generation phase. Enable via `model.ppo.offload_adam_states=True` (or equivalent REINFORCE config). This is especially important for large models (7B+) where optimizer states can consume 2-4x the model parameter memory.
The Insight (Rule of Thumb)
- Action: Set `offload_adam_states: True` in the PPO/REINFORCE actor config.
- Value: Boolean flag.
- Trade-off: Adds CPU-GPU transfer overhead at the boundary between generation and training phases. The transfer is non-blocking but requires a `cuda.synchronize()` barrier.
- Memory Savings: Frees approximately 2x model size in VRAM (Adam stores 2 running averages per parameter).
Reasoning
Distributed Adam optimizer maintains 4 state tensors per parameter bucket: the parameter shard itself, parameter remainders (for mixed precision), the first moment (exp_avg), and the second moment (exp_avg_sq). For a 7B parameter model with BF16 training, this can occupy 28GB+ of VRAM. Since these states are only needed during the backward pass and optimizer step, offloading them during the forward-only generation phase is a safe and effective memory optimization.
Code Evidence
Offloading context manager from `nemo_aligner/utils/utils.py:266-299`:
def dist_adam_load_state_bucket_into_device(state_bucket, device):
"""put the state bucket onto a device"""
attrs_to_offload = ["params_shard", "param_remainders_shard", "exp_avg_shard", "exp_avg_sq_shard"]
for attr in attrs_to_offload:
tensor = getattr(state_bucket, attr)
if tensor is not None:
setattr(state_bucket, attr, tensor.to(device=device, non_blocking=True))
@contextmanager
def offload_distributed_adam(state_dict, force_clear_memory=False):
"""context manager to offload distributed adam states"""
for state_bucket in state_dict["state"]["buckets"]:
dist_adam_load_state_bucket_into_device(state_bucket, device="cpu")
torch.cuda.synchronize()
if force_clear_memory:
clear_memory()
try:
yield
finally:
for state_bucket in state_dict["state"]["buckets"]:
dist_adam_load_state_bucket_into_device(state_bucket, device=torch.cuda.current_device())
torch.cuda.synchronize()
Config default from `examples/nlp/gpt/conf/gpt_ppo_actor.yaml:152`:
offload_adam_states: True