Principle:Bitsandbytes foundation Bitsandbytes 8bit Adam Optimizer
| Sources | Paper: 8-bit Optimizers via Block-wise Quantization, Paper: Adam: A Method for Stochastic Optimization, Repo: bitsandbytes |
|---|---|
| Domains | Optimization |
| Last updated | 2026-02-07 14:00 GMT |
Overview
An Adam optimizer variant that maintains first and second moment estimates in 8-bit quantized format for approximately 75% memory savings on optimizer states, while preserving the convergence properties of standard Adam.
Description
Adam8bit implements the standard Adam algorithm but stores the momentum (m_t) and variance (v_t) states in 8-bit rather than 32-bit. The optimization procedure at each step is:
- Dequantize: Convert 8-bit states back to FP32 using their per-block scaling factors and quantization codebooks.
- Adam update: Apply the standard Adam update rule in FP32 precision:
m_t = beta1 * m_{t-1} + (1 - beta1) * g_t(first moment / momentum)v_t = beta2 * v_{t-1} + (1 - beta2) * g_t^2(second moment / variance)
- Re-quantize: Convert the updated FP32 states back to 8-bit using block-wise quantization.
Key behavioral details:
- The
min_8bit_sizeparameter (default: 4096) controls the minimum tensor size for 8-bit optimization. Parameters with fewer elements than this threshold are optimized in full 32-bit precision. This is important because very small tensors (e.g., bias vectors, layer norm parameters) do not benefit from block-wise quantization and may suffer from quantization error. - Percentile clipping (
percentile_clipping, default: 100 meaning disabled) provides gradient clipping based on tracking the last 100 gradient norms. When enabled (e.g., set to 5), it clips gradients at the 5th percentile of recent gradient norms, improving training stability. - Block-wise quantization (
block_wise=Trueby default) independently quantizes blocks of 256 elements for optimizer states, ensuring outlier values in one block do not degrade precision across the entire tensor. - Also available as
PagedAdam8bitwith paged memory support for GPU-to-CPU offload on OOM conditions.
Usage
Adam8bit is a drop-in replacement for torch.optim.Adam when training memory is constrained:
import bitsandbytes as bnb
# Replace: optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
# With:
optimizer = bnb.optim.Adam8bit(model.parameters(), lr=1e-3)
The optimizer works with any PyTorch model. No changes to the training loop, loss computation, or gradient computation are required. The memory savings are most significant for large models where optimizer states dominate GPU memory usage.
Common use cases:
- Training large language models on limited GPU memory
- Fine-tuning pretrained models where optimizer state memory is the bottleneck
- Any scenario using Adam where ~75% reduction in optimizer state memory is beneficial
Theoretical Basis
The standard Adam update rule with bias correction:
# Moment estimates
m_t = beta1 * m_{t-1} + (1 - beta1) * g_t
v_t = beta2 * v_{t-1} + (1 - beta2) * g_t^2
# Bias correction
m_hat = m_t / (1 - beta1^t)
v_hat = v_t / (1 - beta2^t)
# Parameter update
theta_t = theta_{t-1} - lr * m_hat / (sqrt(v_hat) + epsilon)
The 8-bit quantization of m_t and v_t introduces small rounding errors that are negligible in practice because:
- States change slowly between steps: The exponential moving averages with typical beta values (0.9 and 0.999) mean that states evolve gradually. The quantization error at any step is small relative to the state magnitude.
- Block-wise quantization limits error propagation: Each block of 256 elements is independently quantized, so an outlier in one block does not affect the precision available for other blocks.
- Dynamic quantization maps match state distributions: The non-uniform 256-level codebook is designed to have higher resolution where state values are most densely distributed, minimizing the average quantization error.
The signed dynamic map is used for momentum (which can be positive or negative), while the unsigned dynamic map is used for variance (which is always non-negative), further optimizing the quantization precision for each state type.