Heuristic:OpenRLHF OpenRLHF Adam Offload Memory Tip

Knowledge Sources	OpenRLHF Common DeepSpeed optimization pattern
Domains	Optimization, LLMs, Distributed_Training
Last Updated	2026-02-07 10:00 GMT

Overview

Use `--adam_offload` to move optimizer states to CPU, freeing GPU VRAM at the cost of training speed.

Description

Adam optimizer maintains two state tensors per parameter (first and second moments), effectively tripling the memory footprint compared to the model weights alone. With `--adam_offload`, OpenRLHF switches from `FusedAdam` (GPU-resident, fast) to `DeepSpeedCPUAdam` (CPU-resident, slower), moving these optimizer states to system RAM. This frees substantial GPU VRAM, enabling training of larger models or larger batch sizes. When adam_offload is active, additional state offloading via `offload_deepspeed_states()` is automatically skipped since the states are already on CPU.

Usage

Use this heuristic when you are VRAM constrained during training, especially with large models (7B+). It pairs well with gradient checkpointing for maximum memory savings. Disable this (do not pass `--adam_offload`) when you have sufficient GPU memory and want maximum training speed.

The Insight (Rule of Thumb)

Action: Add `--adam_offload` to the training command.
Value: Moves ~2x model parameter memory from GPU to CPU RAM.
Trade-off: Slower training due to CPU-GPU data transfer for optimizer steps. FusedAdam is significantly faster than DeepSpeedCPUAdam.
Interaction: When adam_offload is active, additional state offloading is redundant and automatically skipped.

Reasoning

For a 7B parameter model in bf16, the model weights occupy ~14GB VRAM. Adam's momentum and variance tensors add another ~28GB in fp32. Offloading these to CPU frees the ~28GB of VRAM, allowing the freed memory to be used for larger batches, gradient accumulation, or activation storage. The trade-off is additional PCIe bandwidth usage for CPU-GPU transfer during each optimizer step.

Code evidence from `openrlhf/utils/deepspeed/deepspeed.py:138`:

AdamOptimizer = DeepSpeedCPUAdam if self.adam_offload else FusedAdam

DeepSpeed config from `openrlhf/utils/deepspeed/deepspeed_utils.py:24-26`:

"offload_optimizer": {
    "device": "cpu" if adam_offload else "none",
    "pin_memory": True,
},

Auto-skip of redundant state offloading from `openrlhf/utils/deepspeed/deepspeed_utils.py:147-149`:

# state offloading not required when using Adam optimizer offloading
if adam_offload:
    return

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment