Implementation:Bitsandbytes foundation Bitsandbytes PagedAdamW8bit
Metadata
| Field | Value |
|---|---|
| Sources | Repo: bitsandbytes, Paper: 8-bit Optimizers |
| Domains | Optimization, Memory_Management |
| Last updated | 2026-02-07 14:00 GMT |
Overview
Concrete tool for memory-efficient AdamW optimization with 8-bit states and CUDA paged memory provided by the bitsandbytes library.
Description
PagedAdamW8bit inherits from Optimizer2State with is_paged=True hardcoded. It uses the "adam" optimizer kernel with decoupled weight decay (AdamW formulation). States are 8-bit quantized and stored in CUDA managed memory for automatic paging.
Key implementation details:
- Calls
super().__init__("adam", ...)withis_paged=True(L327) - The
optim_bitsparameter is hardcoded to 8 internally (L321), regardless of the constructor signature default amsgrad=Trueis explicitly unsupported and raisesValueError(L306-307)- Commonly used in FSDP QLoRA training pipelines
Code Reference
| Field | Value |
|---|---|
| Source | bitsandbytes repo |
| File | bitsandbytes/optim/adamw.py
|
| Lines | L261-327 (PagedAdamW8bit) |
| Also | bitsandbytes/optim/optimizer.py (Optimizer2State L384-625)
|
Signature
class PagedAdamW8bit(Optimizer2State):
def __init__(
self,
params,
lr=1e-3,
betas=(0.9, 0.999),
eps=1e-8,
weight_decay=1e-2,
amsgrad=False,
optim_bits=32,
args=None,
min_8bit_size=4096,
percentile_clipping=100,
block_wise=True,
):
Import
import bitsandbytes as bnb
optimizer = bnb.optim.PagedAdamW8bit(params, lr=2e-4)
I/O Contract
Inputs
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
| params | iterable | Yes | -- | Model parameters to optimize |
| lr | float | No | 1e-3 | Learning rate |
| betas | tuple(float, float) | No | (0.9, 0.999) | Decay rates for first and second moment estimates |
| eps | float | No | 1e-8 | Epsilon for numerical stability |
| weight_decay | float | No | 1e-2 | Decoupled weight decay coefficient |
| amsgrad | bool | No | False | Must be False -- raises ValueError if True |
| optim_bits | int | No | 32 | Must be 32 (default) -- internally hardcoded to 8-bit |
| min_8bit_size | int | No | 4096 | Minimum tensor size for 8-bit optimization |
| percentile_clipping | int | No | 100 | Gradient clipping percentile (100 = no clipping) |
| block_wise | bool | No | True | Enable block-wise quantization for stability |
Note: is_paged is always True (hardcoded in the super().__init__ call).
Outputs
In-place parameter updates with paged 8-bit optimizer states. The optimizer maintains two 8-bit state tensors per parameter (first moment and second moment) in CUDA managed memory.
Usage Examples
Using PagedAdamW8bit in an FSDP training script:
import bitsandbytes as bnb
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
# Wrap model in FSDP (after loading with 4-bit quantization)
model = FSDP(model, ...)
# Use paged 8-bit optimizer for memory efficiency
optimizer = bnb.optim.PagedAdamW8bit(
model.parameters(),
lr=2e-4,
betas=(0.9, 0.999),
eps=1e-8,
weight_decay=0.01,
)
# Training loop
for batch in dataloader:
loss = model(**batch).loss
loss.backward()
optimizer.step()
optimizer.zero_grad()