Principle:Bitsandbytes foundation Bitsandbytes Paged Optimizer

Metadata

Field	Value
Sources	Paper: 8-bit Optimizers, Repo: bitsandbytes
Domains	Optimization, Memory_Management
Last updated	2026-02-07 14:00 GMT

Overview

An optimizer memory management strategy that uses CUDA unified memory (managed memory) to automatically page optimizer states between GPU and CPU on memory pressure.

Description

Paged optimizers store their state tensors in CUDA managed memory. When GPU memory is exhausted, CUDA automatically pages data to CPU memory. This prevents out-of-memory errors during training by gracefully spilling optimizer states to system RAM.

Combined with 8-bit quantization, paged optimizers provide a double memory reduction:

75% reduction from 8-bit quantization of optimizer states (compared to 32-bit)
Overflow protection from paging -- states that exceed GPU VRAM are transparently migrated to CPU RAM

PagedAdamW8bit is commonly used for FSDP QLoRA training where multiple GPUs share model parameters and optimizer states can be large.

Usage

For training scenarios where GPU memory is tight, especially distributed training with FSDP where optimizer states can be large. The is_paged=True flag enables paging in the optimizer base class.

import bitsandbytes as bnb

# PagedAdamW8bit has is_paged=True hardcoded
optimizer = bnb.optim.PagedAdamW8bit(
    model.parameters(),
    lr=2e-4,
    weight_decay=0.01,
)

Theoretical Basis

CUDA Managed Memory (cudaMallocManaged) creates a unified address space visible to both CPU and GPU. The CUDA driver handles page migration transparently:

Pages accessed by the GPU are migrated to VRAM
Pages not recently used can be evicted to system RAM when VRAM is under pressure
The programmer sees a single pointer that works on both CPU and GPU

The optimizer step is synchronous: all paged tensor operations complete before the step returns. This ensures correctness -- no parameter update is lost due to incomplete page migration.

The paging mechanism is particularly effective for optimizer states because:

Optimizer states are only accessed during the optimizer step (not during forward/backward)
States for different parameter groups can be paged independently
The access pattern is predictable, allowing the CUDA driver to prefetch effectively

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment