Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Bitsandbytes foundation Bitsandbytes PagedAdamW8bit

From Leeroopedia


Metadata

Field Value
Sources Repo: bitsandbytes, Paper: 8-bit Optimizers
Domains Optimization, Memory_Management
Last updated 2026-02-07 14:00 GMT

Overview

Concrete tool for memory-efficient AdamW optimization with 8-bit states and CUDA paged memory provided by the bitsandbytes library.

Description

PagedAdamW8bit inherits from Optimizer2State with is_paged=True hardcoded. It uses the "adam" optimizer kernel with decoupled weight decay (AdamW formulation). States are 8-bit quantized and stored in CUDA managed memory for automatic paging.

Key implementation details:

  • Calls super().__init__("adam", ...) with is_paged=True (L327)
  • The optim_bits parameter is hardcoded to 8 internally (L321), regardless of the constructor signature default
  • amsgrad=True is explicitly unsupported and raises ValueError (L306-307)
  • Commonly used in FSDP QLoRA training pipelines

Code Reference

Field Value
Source bitsandbytes repo
File bitsandbytes/optim/adamw.py
Lines L261-327 (PagedAdamW8bit)
Also bitsandbytes/optim/optimizer.py (Optimizer2State L384-625)

Signature

class PagedAdamW8bit(Optimizer2State):
    def __init__(
        self,
        params,
        lr=1e-3,
        betas=(0.9, 0.999),
        eps=1e-8,
        weight_decay=1e-2,
        amsgrad=False,
        optim_bits=32,
        args=None,
        min_8bit_size=4096,
        percentile_clipping=100,
        block_wise=True,
    ):

Import

import bitsandbytes as bnb

optimizer = bnb.optim.PagedAdamW8bit(params, lr=2e-4)

I/O Contract

Inputs

Parameter Type Required Default Description
params iterable Yes -- Model parameters to optimize
lr float No 1e-3 Learning rate
betas tuple(float, float) No (0.9, 0.999) Decay rates for first and second moment estimates
eps float No 1e-8 Epsilon for numerical stability
weight_decay float No 1e-2 Decoupled weight decay coefficient
amsgrad bool No False Must be False -- raises ValueError if True
optim_bits int No 32 Must be 32 (default) -- internally hardcoded to 8-bit
min_8bit_size int No 4096 Minimum tensor size for 8-bit optimization
percentile_clipping int No 100 Gradient clipping percentile (100 = no clipping)
block_wise bool No True Enable block-wise quantization for stability

Note: is_paged is always True (hardcoded in the super().__init__ call).

Outputs

In-place parameter updates with paged 8-bit optimizer states. The optimizer maintains two 8-bit state tensors per parameter (first moment and second moment) in CUDA managed memory.

Usage Examples

Using PagedAdamW8bit in an FSDP training script:

import bitsandbytes as bnb
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP

# Wrap model in FSDP (after loading with 4-bit quantization)
model = FSDP(model, ...)

# Use paged 8-bit optimizer for memory efficiency
optimizer = bnb.optim.PagedAdamW8bit(
    model.parameters(),
    lr=2e-4,
    betas=(0.9, 0.999),
    eps=1e-8,
    weight_decay=0.01,
)

# Training loop
for batch in dataloader:
    loss = model(**batch).loss
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

Related

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment