Principle:Bitsandbytes foundation Bitsandbytes Blockwise Quantization
| Sources | Paper: 8-bit Optimizers via Block-wise Quantization, Paper: 8-Bit Approximations for Parallelism in Deep Learning, Repo: bitsandbytes |
|---|---|
| Domains | Quantization |
| Last updated | 2026-02-07 14:00 GMT |
Overview
A quantization strategy that divides tensors into fixed-size blocks, each independently quantized with its own scaling factor, to reduce quantization error caused by outlier values. This is the fundamental quantization building block used across all bitsandbytes quantization features.
Description
The outlier problem with global quantization: In naive quantization, a single scaling factor is computed for the entire tensor based on its global maximum absolute value. If the tensor contains outlier values (common in neural network weights and optimizer states), the scaling factor is dominated by these outliers, causing the vast majority of values to be represented with very few quantization levels. This leads to high quantization error for the typical (non-outlier) values.
Block-wise quantization solves this by dividing the tensor into contiguous blocks of B elements. Each block is independently quantized with its own absmax scaling factor:
- Partition the tensor into blocks of B elements.
- For each block, compute the absolute maximum value as the scaling factor.
- Normalize each block by its scaling factor, mapping values to the range [-1, 1].
- Map each normalized value to the nearest point in a quantization codebook.
This dramatically reduces quantization error because local outliers only affect their block, not the entire tensor. A single large value in block k has no impact on the quantization precision available for values in other blocks.
Quantization codebooks are non-uniform mappings designed to match the distribution of the data being quantized:
- 8-bit dynamic map (256 levels): Used for optimizer states. Created by
create_dynamic_map(). Signed variant for momentum, unsigned for variance. - 4-bit NF4 (16 levels): Used for weight quantization in QLoRA. Levels are placed at quantiles of a standard normal distribution.
- 4-bit FP4 (16 levels): An alternative 4-bit floating-point format for weight quantization.
Block sizes vary by use case:
- 4096: Default for optimizer state quantization via
quantize_blockwise. - 256: Used internally by the optimizer CUDA kernels (
optimizer_update_8bit_blockwise). - 64 or 128: Used for 4-bit weight quantization in
quantize_4bit.
Nested quantization provides additional compression: the absmax scaling factors themselves (one float32 per block) can be further quantized using a second level of block-wise quantization. This is particularly useful for 4-bit weight quantization where the number of blocks (and thus absmax values) can be large. In nested mode, the absmax values are mean-centered, then blockwise-quantized with blocksize=256.
Usage
Block-wise quantization is the core building block used by:
- 8-bit optimizers: Optimizer states (momentum, variance) are quantized with
blocksize=4096viaquantize_blockwise. The optimizer update kernels use an internal blocksize of 256. - 4-bit weight quantization (QLoRA /
Linear4bit): Model weights are quantized withblocksize=64orblocksize=128using NF4 or FP4 codebooks. - Direct usage:
quantize_blockwiseanddequantize_blockwisecan be called directly for custom quantization needs.
Valid block sizes are: 64, 128, 256, 512, 1024, 2048, and 4096.
Theoretical Basis
For a tensor T divided into blocks B_1, B_2, ..., B_k each of size B:
Per-block quantization:
absmax_i = max(|B_i|) # scaling factor for block i
normalized_i = B_i / absmax_i # values in [-1, 1]
quantized_i = nearest_in_codebook(normalized_i) # map to discrete levels
Per-block dequantization:
dequantized_i = codebook[quantized_i] * absmax_i
Error bound: The maximum quantization error for any value in block i is bounded by:
max_error_i <= absmax_i * (1/2 * max_codebook_spacing)
where max_codebook_spacing is the largest gap between adjacent codebook levels. Since absmax_i is a local maximum (not the global maximum), this bound is tighter than for global quantization whenever outliers are not uniformly distributed.
Nested quantization: The absmax values (one float32 per block) add overhead. With nested quantization:
offset = mean(absmax_values)
centered = absmax_values - offset
quant_absmax, nested_state = quantize_blockwise(centered, blocksize=256)
This reduces the overhead from 32 bits per block to approximately 8 bits per block plus a small additional nested state, at the cost of slightly increased dequantization time.
Memory overhead comparison (for a tensor of N elements):
| Method | Overhead per element | Absmax storage |
|---|---|---|
| Global quantization | None | 1 float32 total |
| Block-wise (B=4096) | 32/4096 = 0.0078 bits | N/4096 float32 values |
| Block-wise + nested | ~8/4096 = 0.002 bits | N/4096 uint8 values + nested state |