Heuristic:Bitsandbytes foundation Bitsandbytes Compressed Statistics Double Quantization
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Quantization |
| Last Updated | 2026-02-07 13:00 GMT |
Overview
Double quantization (compress_statistics=True) reduces memory overhead by quantizing the absmax scaling factors themselves, saving ~0.5 bits per parameter with minimal accuracy loss.
Description
Standard 4-bit quantization stores one float32 absmax value per block of 64 or 128 values, adding a non-trivial memory overhead (0.5 bits per parameter for blocksize=64). Double quantization (also called nested quantization or compressed statistics) addresses this by applying a second level of blockwise quantization to the absmax values themselves. The process: (1) compute the mean of absmax values as an offset, (2) subtract the offset, (3) quantize the centered absmax values using blockwise quantization with a fixed blocksize of 256. The result is a `state2` QuantState object stored alongside the primary quantization state.
Usage
Apply this heuristic when memory is a constraint and you are using 4-bit quantization. Enable with `compress_statistics=True` in `BitsAndBytesConfig` or `quantize_4bit()`. This is the default in QLoRA training configurations. The memory savings become more significant with smaller blocksizes (more absmax values to compress).
The Insight (Rule of Thumb)
- Action: Set `compress_statistics=True` in `BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_use_double_quant=True)`.
- Value: Fixed nested blocksize of 256 for absmax quantization, independent of the outer blocksize.
- Trade-off: Saves ~0.37 bits per parameter (for blocksize=64) at the cost of slightly slower dequantization (one extra quantize/dequantize step for the statistics).
- Offset centering: The absmax mean is subtracted before nested quantization to improve the quantization distribution.
- No cascading: The nested quantization is always non-nested (the second-level absmax is stored in float32).
Reasoning
Without compression, a blocksize=64 model stores one FP32 absmax per 64 params, adding 32/(64*4) = 0.125 = 12.5% overhead. With double quantization, the absmax values are themselves quantized to 8-bit, reducing this to roughly 8/(64*4) + small state2 overhead ≈ 3% overhead. The mean-offset centering ensures the absmax values (which are all positive) are better distributed for the quantization codebook.
Double quantization in 4-bit from `bitsandbytes/functional.py:873-886`:
if compress_statistics:
offset = _absmax.mean()
qabsmax, state2 = quantize_blockwise(_absmax - offset, blocksize=256)
del _absmax
state = QuantState(
absmax=qabsmax,
shape=input_shape,
dtype=A.dtype,
blocksize=blocksize,
code=code,
quant_type=quant_type,
offset=offset,
state2=state2,
)
Nested quantization in blockwise from `bitsandbytes/functional.py:616-627`:
if nested:
offset = _absmax.mean()
_absmax -= offset
qabsmax, state2 = quantize_blockwise(_absmax, blocksize=blocksize, nested=False)
quant_state = QuantState(
absmax=qabsmax,
code=code.to(A.device, copy=True),
blocksize=blocksize,
dtype=A.dtype,
offset=offset,
state2=state2,
)