Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Bitsandbytes foundation Bitsandbytes Compressed Statistics Double Quantization

From Leeroopedia



Knowledge Sources
Domains Optimization, Quantization
Last Updated 2026-02-07 13:00 GMT

Overview

Double quantization (compress_statistics=True) reduces memory overhead by quantizing the absmax scaling factors themselves, saving ~0.5 bits per parameter with minimal accuracy loss.

Description

Standard 4-bit quantization stores one float32 absmax value per block of 64 or 128 values, adding a non-trivial memory overhead (0.5 bits per parameter for blocksize=64). Double quantization (also called nested quantization or compressed statistics) addresses this by applying a second level of blockwise quantization to the absmax values themselves. The process: (1) compute the mean of absmax values as an offset, (2) subtract the offset, (3) quantize the centered absmax values using blockwise quantization with a fixed blocksize of 256. The result is a `state2` QuantState object stored alongside the primary quantization state.

Usage

Apply this heuristic when memory is a constraint and you are using 4-bit quantization. Enable with `compress_statistics=True` in `BitsAndBytesConfig` or `quantize_4bit()`. This is the default in QLoRA training configurations. The memory savings become more significant with smaller blocksizes (more absmax values to compress).

The Insight (Rule of Thumb)

  • Action: Set `compress_statistics=True` in `BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_use_double_quant=True)`.
  • Value: Fixed nested blocksize of 256 for absmax quantization, independent of the outer blocksize.
  • Trade-off: Saves ~0.37 bits per parameter (for blocksize=64) at the cost of slightly slower dequantization (one extra quantize/dequantize step for the statistics).
  • Offset centering: The absmax mean is subtracted before nested quantization to improve the quantization distribution.
  • No cascading: The nested quantization is always non-nested (the second-level absmax is stored in float32).

Reasoning

Without compression, a blocksize=64 model stores one FP32 absmax per 64 params, adding 32/(64*4) = 0.125 = 12.5% overhead. With double quantization, the absmax values are themselves quantized to 8-bit, reducing this to roughly 8/(64*4) + small state2 overhead ≈ 3% overhead. The mean-offset centering ensures the absmax values (which are all positive) are better distributed for the quantization codebook.

Double quantization in 4-bit from `bitsandbytes/functional.py:873-886`:

if compress_statistics:
    offset = _absmax.mean()
    qabsmax, state2 = quantize_blockwise(_absmax - offset, blocksize=256)
    del _absmax
    state = QuantState(
        absmax=qabsmax,
        shape=input_shape,
        dtype=A.dtype,
        blocksize=blocksize,
        code=code,
        quant_type=quant_type,
        offset=offset,
        state2=state2,
    )

Nested quantization in blockwise from `bitsandbytes/functional.py:616-627`:

if nested:
    offset = _absmax.mean()
    _absmax -= offset
    qabsmax, state2 = quantize_blockwise(_absmax, blocksize=blocksize, nested=False)
    quant_state = QuantState(
        absmax=qabsmax,
        code=code.to(A.device, copy=True),
        blocksize=blocksize,
        dtype=A.dtype,
        offset=offset,
        state2=state2,
    )

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment