Principle:Bitsandbytes foundation Bitsandbytes Paged Optimizer
Metadata
| Field | Value |
|---|---|
| Sources | Paper: 8-bit Optimizers, Repo: bitsandbytes |
| Domains | Optimization, Memory_Management |
| Last updated | 2026-02-07 14:00 GMT |
Overview
An optimizer memory management strategy that uses CUDA unified memory (managed memory) to automatically page optimizer states between GPU and CPU on memory pressure.
Description
Paged optimizers store their state tensors in CUDA managed memory. When GPU memory is exhausted, CUDA automatically pages data to CPU memory. This prevents out-of-memory errors during training by gracefully spilling optimizer states to system RAM.
Combined with 8-bit quantization, paged optimizers provide a double memory reduction:
- 75% reduction from 8-bit quantization of optimizer states (compared to 32-bit)
- Overflow protection from paging -- states that exceed GPU VRAM are transparently migrated to CPU RAM
PagedAdamW8bit is commonly used for FSDP QLoRA training where multiple GPUs share model parameters and optimizer states can be large.
Usage
For training scenarios where GPU memory is tight, especially distributed training with FSDP where optimizer states can be large. The is_paged=True flag enables paging in the optimizer base class.
import bitsandbytes as bnb
# PagedAdamW8bit has is_paged=True hardcoded
optimizer = bnb.optim.PagedAdamW8bit(
model.parameters(),
lr=2e-4,
weight_decay=0.01,
)
Theoretical Basis
CUDA Managed Memory (cudaMallocManaged) creates a unified address space visible to both CPU and GPU. The CUDA driver handles page migration transparently:
- Pages accessed by the GPU are migrated to VRAM
- Pages not recently used can be evicted to system RAM when VRAM is under pressure
- The programmer sees a single pointer that works on both CPU and GPU
The optimizer step is synchronous: all paged tensor operations complete before the step returns. This ensures correctness -- no parameter update is lost due to incomplete page migration.
The paging mechanism is particularly effective for optimizer states because:
- Optimizer states are only accessed during the optimizer step (not during forward/backward)
- States for different parameter groups can be paged independently
- The access pattern is predictable, allowing the CUDA driver to prefetch effectively