Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Bitsandbytes foundation Bitsandbytes 8bit Adam Optimizer

From Leeroopedia


Sources Paper: 8-bit Optimizers via Block-wise Quantization, Paper: Adam: A Method for Stochastic Optimization, Repo: bitsandbytes
Domains Optimization
Last updated 2026-02-07 14:00 GMT

Overview

An Adam optimizer variant that maintains first and second moment estimates in 8-bit quantized format for approximately 75% memory savings on optimizer states, while preserving the convergence properties of standard Adam.

Description

Adam8bit implements the standard Adam algorithm but stores the momentum (m_t) and variance (v_t) states in 8-bit rather than 32-bit. The optimization procedure at each step is:

  1. Dequantize: Convert 8-bit states back to FP32 using their per-block scaling factors and quantization codebooks.
  2. Adam update: Apply the standard Adam update rule in FP32 precision:
    • m_t = beta1 * m_{t-1} + (1 - beta1) * g_t (first moment / momentum)
    • v_t = beta2 * v_{t-1} + (1 - beta2) * g_t^2 (second moment / variance)
  3. Re-quantize: Convert the updated FP32 states back to 8-bit using block-wise quantization.

Key behavioral details:

  • The min_8bit_size parameter (default: 4096) controls the minimum tensor size for 8-bit optimization. Parameters with fewer elements than this threshold are optimized in full 32-bit precision. This is important because very small tensors (e.g., bias vectors, layer norm parameters) do not benefit from block-wise quantization and may suffer from quantization error.
  • Percentile clipping (percentile_clipping, default: 100 meaning disabled) provides gradient clipping based on tracking the last 100 gradient norms. When enabled (e.g., set to 5), it clips gradients at the 5th percentile of recent gradient norms, improving training stability.
  • Block-wise quantization (block_wise=True by default) independently quantizes blocks of 256 elements for optimizer states, ensuring outlier values in one block do not degrade precision across the entire tensor.
  • Also available as PagedAdam8bit with paged memory support for GPU-to-CPU offload on OOM conditions.

Usage

Adam8bit is a drop-in replacement for torch.optim.Adam when training memory is constrained:

import bitsandbytes as bnb

# Replace: optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
# With:
optimizer = bnb.optim.Adam8bit(model.parameters(), lr=1e-3)

The optimizer works with any PyTorch model. No changes to the training loop, loss computation, or gradient computation are required. The memory savings are most significant for large models where optimizer states dominate GPU memory usage.

Common use cases:

  • Training large language models on limited GPU memory
  • Fine-tuning pretrained models where optimizer state memory is the bottleneck
  • Any scenario using Adam where ~75% reduction in optimizer state memory is beneficial

Theoretical Basis

The standard Adam update rule with bias correction:

# Moment estimates
m_t = beta1 * m_{t-1} + (1 - beta1) * g_t
v_t = beta2 * v_{t-1} + (1 - beta2) * g_t^2

# Bias correction
m_hat = m_t / (1 - beta1^t)
v_hat = v_t / (1 - beta2^t)

# Parameter update
theta_t = theta_{t-1} - lr * m_hat / (sqrt(v_hat) + epsilon)

The 8-bit quantization of m_t and v_t introduces small rounding errors that are negligible in practice because:

  1. States change slowly between steps: The exponential moving averages with typical beta values (0.9 and 0.999) mean that states evolve gradually. The quantization error at any step is small relative to the state magnitude.
  2. Block-wise quantization limits error propagation: Each block of 256 elements is independently quantized, so an outlier in one block does not affect the precision available for other blocks.
  3. Dynamic quantization maps match state distributions: The non-uniform 256-level codebook is designed to have higher resolution where state values are most densely distributed, minimizing the average quantization error.

The signed dynamic map is used for momentum (which can be positive or negative), while the unsigned dynamic map is used for variance (which is always non-negative), further optimizing the quantization precision for each state type.

Related

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment