Principle:OpenRLHF OpenRLHF Optimizer and Scheduler Setup

Knowledge Sources	Adam: A Method for Stochastic Optimization DeepSpeed Optimizers
Domains	Optimization, Training_Infrastructure
Last Updated	2026-02-07 00:00 GMT

Overview

A configuration step that creates hardware-optimized Adam optimizers and learning rate schedulers for distributed training.

Description

Optimizer and Scheduler Setup selects between two DeepSpeed Adam implementations based on CPU offloading configuration: FusedAdam (GPU-only, fastest) or DeepSpeedCPUAdam (CPU offloaded optimizer states for memory savings). Learning rate scheduling uses HuggingFace's get_scheduler with cosine or other schedules, typically with warmup.

Usage

Use after model loading and before training. The optimizer is created via strategy.create_optimizer and the scheduler via HuggingFace's get_scheduler.

Theoretical Basis

Adam optimizer: Updates parameters using adaptive first and second moment estimates: $θ_{t + 1} = θ_{t} - \frac{α}{\sqrt{{\hat{v}}_{t}} + ϵ} {\hat{m}}_{t}$

FusedAdam: Fuses multiple CUDA kernels for efficiency on GPU.

DeepSpeedCPUAdam: Offloads optimizer states to CPU RAM, reducing GPU memory by ~2x at the cost of communication overhead.

Cosine scheduler with warmup: Linear warmup followed by cosine decay to minimum learning rate.

Related Pages

Implemented By

Implementation:OpenRLHF_OpenRLHF_DeepspeedStrategy_create_optimizer

Uses Heuristic

Heuristic:OpenRLHF_OpenRLHF_Adam_Offload_Memory_Tip

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment