Heuristic:Eric mitchell Direct preference optimization RMSprop Over Adam

Knowledge Sources	Direct Preference Optimization DPO Authors
Domains	Optimization, Deep_Learning
Last Updated	2026-02-08 02:00 GMT

Overview

Use RMSprop instead of Adam as the default optimizer to reduce memory usage while maintaining comparable training performance.

Description

The DPO codebase uses RMSprop as the default optimizer instead of the more common Adam or AdamW. This is an explicit design choice documented in the config file: "We use RMSprop because it works about as well as Adam and is more memory-efficient." RMSprop stores one momentum buffer per parameter, while Adam stores two (first and second moment estimates), making RMSprop use approximately 33% less optimizer state memory.

Usage

Use this heuristic when memory efficiency is important, especially for large models (6.9B+ parameters) where optimizer state can consume significant VRAM. The default config sets `optimizer: RMSprop`. To override, pass `optimizer=AdamW` on the command line.

The Insight (Rule of Thumb)

Action: Use `optimizer=RMSprop` (the default in config.yaml).
Value: RMSprop with default PyTorch parameters and `lr=5e-7`.
Trade-off: ~33% less optimizer state memory compared to Adam (1 buffer vs 2 buffers per parameter), with comparable training performance for DPO/SFT. Adam may converge slightly faster in some cases.
Compatibility: Works with all trainer classes. The optimizer is instantiated dynamically via `getattr(torch.optim, config.optimizer)`.

Reasoning

For a model with N parameters in FP32:

Adam/AdamW: Requires 2N * 4 bytes = 8N bytes of optimizer state (first moment + second moment).
RMSprop: Requires 1N * 4 bytes = 4N bytes of optimizer state (running average of squared gradients).

For a 6.9B parameter model, this saves approximately 27.6GB of memory. The DPO authors explicitly chose RMSprop after finding it performs comparably to Adam for their use case.

Code evidence from `config/config.yaml:79-80`:

# The optimizer to use; we use RMSprop because it works about as well as Adam and is more memory-efficient
optimizer: RMSprop

Optimizer instantiation in `trainers.py:276`:

rank0_print(f'Using {self.config.optimizer} optimizer')
self.optimizer = getattr(torch.optim, self.config.optimizer)(self.policy.parameters(), lr=self.config.lr)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment