Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:OpenRLHF OpenRLHF Adam Offload Memory Tip

From Leeroopedia




Knowledge Sources
Domains Optimization, LLMs, Distributed_Training
Last Updated 2026-02-07 10:00 GMT

Overview

Use `--adam_offload` to move optimizer states to CPU, freeing GPU VRAM at the cost of training speed.

Description

Adam optimizer maintains two state tensors per parameter (first and second moments), effectively tripling the memory footprint compared to the model weights alone. With `--adam_offload`, OpenRLHF switches from `FusedAdam` (GPU-resident, fast) to `DeepSpeedCPUAdam` (CPU-resident, slower), moving these optimizer states to system RAM. This frees substantial GPU VRAM, enabling training of larger models or larger batch sizes. When adam_offload is active, additional state offloading via `offload_deepspeed_states()` is automatically skipped since the states are already on CPU.

Usage

Use this heuristic when you are VRAM constrained during training, especially with large models (7B+). It pairs well with gradient checkpointing for maximum memory savings. Disable this (do not pass `--adam_offload`) when you have sufficient GPU memory and want maximum training speed.

The Insight (Rule of Thumb)

  • Action: Add `--adam_offload` to the training command.
  • Value: Moves ~2x model parameter memory from GPU to CPU RAM.
  • Trade-off: Slower training due to CPU-GPU data transfer for optimizer steps. FusedAdam is significantly faster than DeepSpeedCPUAdam.
  • Interaction: When adam_offload is active, additional state offloading is redundant and automatically skipped.

Reasoning

For a 7B parameter model in bf16, the model weights occupy ~14GB VRAM. Adam's momentum and variance tensors add another ~28GB in fp32. Offloading these to CPU frees the ~28GB of VRAM, allowing the freed memory to be used for larger batches, gradient accumulation, or activation storage. The trade-off is additional PCIe bandwidth usage for CPU-GPU transfer during each optimizer step.

Code evidence from `openrlhf/utils/deepspeed/deepspeed.py:138`:

AdamOptimizer = DeepSpeedCPUAdam if self.adam_offload else FusedAdam

DeepSpeed config from `openrlhf/utils/deepspeed/deepspeed_utils.py:24-26`:

"offload_optimizer": {
    "device": "cpu" if adam_offload else "none",
    "pin_memory": True,
},

Auto-skip of redundant state offloading from `openrlhf/utils/deepspeed/deepspeed_utils.py:147-149`:

# state offloading not required when using Adam optimizer offloading
if adam_offload:
    return

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment