Principle:OpenRLHF OpenRLHF Optimizer and Scheduler Setup
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Training_Infrastructure |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
A configuration step that creates hardware-optimized Adam optimizers and learning rate schedulers for distributed training.
Description
Optimizer and Scheduler Setup selects between two DeepSpeed Adam implementations based on CPU offloading configuration: FusedAdam (GPU-only, fastest) or DeepSpeedCPUAdam (CPU offloaded optimizer states for memory savings). Learning rate scheduling uses HuggingFace's get_scheduler with cosine or other schedules, typically with warmup.
Usage
Use after model loading and before training. The optimizer is created via strategy.create_optimizer and the scheduler via HuggingFace's get_scheduler.
Theoretical Basis
Adam optimizer: Updates parameters using adaptive first and second moment estimates:
FusedAdam: Fuses multiple CUDA kernels for efficiency on GPU.
DeepSpeedCPUAdam: Offloads optimizer states to CPU RAM, reducing GPU memory by ~2x at the cost of communication overhead.
Cosine scheduler with warmup: Linear warmup followed by cosine decay to minimum learning rate.
Related Pages
Implemented By