Principle:Bitsandbytes foundation Bitsandbytes 8bit Optimizer State Quantization

Sources	Paper: 8-bit Optimizers via Block-wise Quantization, Repo: bitsandbytes
Domains	Optimization, Memory_Management
Last updated	2026-02-07 14:00 GMT

Overview

A memory optimization technique that quantizes optimizer state tensors (momentum, variance) from 32-bit to 8-bit using dynamic blockwise quantization. This enables training large models with significantly reduced memory overhead from optimizer states.

Description

In standard Adam optimization, two state tensors are maintained per parameter:

First moment (momentum): exponential moving average of gradients (m_t)
Second moment (variance): exponential moving average of squared gradients (v_t)

Each state tensor is stored in FP32, meaning optimizer states consume 2x the memory of the model parameters themselves. For a model with N parameters in FP32 (4 bytes each), the optimizer states require 8N bytes, compared to 4N bytes for the model weights.

8-bit optimizer state quantization reduces this overhead by approximately 75%: states are stored in quantized 8-bit format with per-block scaling factors. The quantization-dequantization cycle operates as follows:

Dequantize: Before each optimizer step, 8-bit states are dequantized back to FP32 for computation.
Update: The standard optimizer update rule (e.g., Adam) is applied in FP32 precision.
Re-quantize: Updated states are quantized back to 8-bit for storage.

The key insight enabling this approach is that optimizer states are smooth -- they change slowly between consecutive training steps. This temporal smoothness means the quantization error introduced at each step is small relative to the state values, and does not accumulate destructively over training.

Block-wise quantization (typically with blocks of 4096 elements for optimizer states) ensures that local outlier values do not degrade quantization quality across the entire tensor. Each block is independently scaled by its own absolute maximum value, confining the impact of outliers to their local block.

The technique also employs dynamic quantization maps -- non-uniform mappings of 256 quantization levels that are distributed to better represent the actual distribution of optimizer state values, rather than using uniform spacing.

Usage

This principle applies when:

Training large models where optimizer state memory is the bottleneck (common for billion-parameter models)
Using Adam-family optimizers that maintain two state tensors per parameter
A drop-in replacement for standard optimizers is desired, with no changes to the training loop
The min_8bit_size parameter controls which tensors use 8-bit quantization; tensors smaller than this threshold (default 4096 elements) remain in 32-bit precision

Theoretical Basis

The quantization system consists of several components:

Dynamic 8-bit quantization map: A codebook of 256 levels distributed non-uniformly to better represent the distribution of optimizer states. The map is created via create_dynamic_map(signed=True) for momentum (which can be negative) and create_dynamic_map(signed=False) for variance (which is always non-negative).

Block-wise quantization: The tensor is divided into blocks of B elements (default B=4096 for optimizer states). Each block B_i is quantized independently:

absmax_i = max(|B_i|)
normalized_i = B_i / absmax_i       # values in [-1, 1]
quantized_i = nearest_in_codebook(normalized_i)  # map to 256 levels

Quantization-dequantization cycle per step:

# Before update
state_fp32 = dequantize(state_8bit, absmax, codebook)

# Standard optimizer update in FP32
optimizer_update(state_fp32, grad)

# After update
state_8bit, absmax = quantize(state_fp32, codebook)

The per-block scaling ensures that a single large value in one region of the tensor does not reduce the precision available for representing the many smaller values elsewhere.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment