Heuristic:Eric mitchell Direct preference optimization RMSprop Over Adam
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Deep_Learning |
| Last Updated | 2026-02-08 02:00 GMT |
Overview
Use RMSprop instead of Adam as the default optimizer to reduce memory usage while maintaining comparable training performance.
Description
The DPO codebase uses RMSprop as the default optimizer instead of the more common Adam or AdamW. This is an explicit design choice documented in the config file: "We use RMSprop because it works about as well as Adam and is more memory-efficient." RMSprop stores one momentum buffer per parameter, while Adam stores two (first and second moment estimates), making RMSprop use approximately 33% less optimizer state memory.
Usage
Use this heuristic when memory efficiency is important, especially for large models (6.9B+ parameters) where optimizer state can consume significant VRAM. The default config sets `optimizer: RMSprop`. To override, pass `optimizer=AdamW` on the command line.
The Insight (Rule of Thumb)
- Action: Use `optimizer=RMSprop` (the default in config.yaml).
- Value: RMSprop with default PyTorch parameters and `lr=5e-7`.
- Trade-off: ~33% less optimizer state memory compared to Adam (1 buffer vs 2 buffers per parameter), with comparable training performance for DPO/SFT. Adam may converge slightly faster in some cases.
- Compatibility: Works with all trainer classes. The optimizer is instantiated dynamically via `getattr(torch.optim, config.optimizer)`.
Reasoning
For a model with N parameters in FP32:
- Adam/AdamW: Requires 2N * 4 bytes = 8N bytes of optimizer state (first moment + second moment).
- RMSprop: Requires 1N * 4 bytes = 4N bytes of optimizer state (running average of squared gradients).
For a 6.9B parameter model, this saves approximately 27.6GB of memory. The DPO authors explicitly chose RMSprop after finding it performs comparably to Adam for their use case.
Code evidence from `config/config.yaml:79-80`:
# The optimizer to use; we use RMSprop because it works about as well as Adam and is more memory-efficient
optimizer: RMSprop
Optimizer instantiation in `trainers.py:276`:
rank0_print(f'Using {self.config.optimizer} optimizer')
self.optimizer = getattr(torch.optim, self.config.optimizer)(self.policy.parameters(), lr=self.config.lr)