Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Bitsandbytes foundation Bitsandbytes Blockwise Quantization

From Leeroopedia


Sources Paper: 8-bit Optimizers via Block-wise Quantization, Paper: 8-Bit Approximations for Parallelism in Deep Learning, Repo: bitsandbytes
Domains Quantization
Last updated 2026-02-07 14:00 GMT

Overview

A quantization strategy that divides tensors into fixed-size blocks, each independently quantized with its own scaling factor, to reduce quantization error caused by outlier values. This is the fundamental quantization building block used across all bitsandbytes quantization features.

Description

The outlier problem with global quantization: In naive quantization, a single scaling factor is computed for the entire tensor based on its global maximum absolute value. If the tensor contains outlier values (common in neural network weights and optimizer states), the scaling factor is dominated by these outliers, causing the vast majority of values to be represented with very few quantization levels. This leads to high quantization error for the typical (non-outlier) values.

Block-wise quantization solves this by dividing the tensor into contiguous blocks of B elements. Each block is independently quantized with its own absmax scaling factor:

  1. Partition the tensor into blocks of B elements.
  2. For each block, compute the absolute maximum value as the scaling factor.
  3. Normalize each block by its scaling factor, mapping values to the range [-1, 1].
  4. Map each normalized value to the nearest point in a quantization codebook.

This dramatically reduces quantization error because local outliers only affect their block, not the entire tensor. A single large value in block k has no impact on the quantization precision available for values in other blocks.

Quantization codebooks are non-uniform mappings designed to match the distribution of the data being quantized:

  • 8-bit dynamic map (256 levels): Used for optimizer states. Created by create_dynamic_map(). Signed variant for momentum, unsigned for variance.
  • 4-bit NF4 (16 levels): Used for weight quantization in QLoRA. Levels are placed at quantiles of a standard normal distribution.
  • 4-bit FP4 (16 levels): An alternative 4-bit floating-point format for weight quantization.

Block sizes vary by use case:

  • 4096: Default for optimizer state quantization via quantize_blockwise.
  • 256: Used internally by the optimizer CUDA kernels (optimizer_update_8bit_blockwise).
  • 64 or 128: Used for 4-bit weight quantization in quantize_4bit.

Nested quantization provides additional compression: the absmax scaling factors themselves (one float32 per block) can be further quantized using a second level of block-wise quantization. This is particularly useful for 4-bit weight quantization where the number of blocks (and thus absmax values) can be large. In nested mode, the absmax values are mean-centered, then blockwise-quantized with blocksize=256.

Usage

Block-wise quantization is the core building block used by:

  • 8-bit optimizers: Optimizer states (momentum, variance) are quantized with blocksize=4096 via quantize_blockwise. The optimizer update kernels use an internal blocksize of 256.
  • 4-bit weight quantization (QLoRA / Linear4bit): Model weights are quantized with blocksize=64 or blocksize=128 using NF4 or FP4 codebooks.
  • Direct usage: quantize_blockwise and dequantize_blockwise can be called directly for custom quantization needs.

Valid block sizes are: 64, 128, 256, 512, 1024, 2048, and 4096.

Theoretical Basis

For a tensor T divided into blocks B_1, B_2, ..., B_k each of size B:

Per-block quantization:

absmax_i = max(|B_i|)                           # scaling factor for block i
normalized_i = B_i / absmax_i                   # values in [-1, 1]
quantized_i = nearest_in_codebook(normalized_i) # map to discrete levels

Per-block dequantization:

dequantized_i = codebook[quantized_i] * absmax_i

Error bound: The maximum quantization error for any value in block i is bounded by:

max_error_i <= absmax_i * (1/2 * max_codebook_spacing)

where max_codebook_spacing is the largest gap between adjacent codebook levels. Since absmax_i is a local maximum (not the global maximum), this bound is tighter than for global quantization whenever outliers are not uniformly distributed.

Nested quantization: The absmax values (one float32 per block) add overhead. With nested quantization:

offset = mean(absmax_values)
centered = absmax_values - offset
quant_absmax, nested_state = quantize_blockwise(centered, blocksize=256)

This reduces the overhead from 32 bits per block to approximately 8 bits per block plus a small additional nested state, at the cost of slightly increased dequantization time.

Memory overhead comparison (for a tensor of N elements):

Method Overhead per element Absmax storage
Global quantization None 1 float32 total
Block-wise (B=4096) 32/4096 = 0.0078 bits N/4096 float32 values
Block-wise + nested ~8/4096 = 0.002 bits N/4096 uint8 values + nested state

Related

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment