Implementation:Bitsandbytes foundation Bitsandbytes Quantize Blockwise

Sources	Repo: bitsandbytes, Paper: 8-bit Optimizers via Block-wise Quantization
Domains	Quantization

Overview

Concrete tool for blockwise tensor quantization to 8-bit provided by the bitsandbytes library. quantize_blockwise divides an input tensor into fixed-size blocks, computes per-block scaling factors, and maps each value to the nearest point in a dynamic quantization codebook of 256 levels.

Description

quantize_blockwise implements the core block-wise quantization algorithm:

Partition the input tensor into contiguous blocks of blocksize elements (default 4096).
Compute per-block absmax: For each block, compute absmax_i = max(|block_i|) as the scaling factor.
Quantize: Normalize each block by its absmax and map each value to the nearest point in a dynamic quantization map (codebook) of 256 levels for 8-bit output.
Return the quantized uint8 tensor and a QuantState object containing the absmax values, codebook, blocksize, and original dtype.

Dynamic quantization map: If no codebook is provided, the function creates and caches a signed dynamic map via create_dynamic_map(signed=True). This map has 256 non-uniformly spaced levels optimized for the typical distribution of neural network values.

Nested quantization: When nested=True, the absmax scaling factors are further quantized to reduce overhead:

Compute the mean of all absmax values as an offset.
Subtract the offset from the absmax values.
Recursively call quantize_blockwise on the centered absmax values (with nested=False).
Store the offset and nested quantization state in the returned QuantState.

The corresponding dequantize_blockwise function reverses the process: it looks up each quantized index in the codebook, multiplies by the per-block absmax, and returns an FP32/FP16/BF16 tensor. When nested quantization was used, it first dequantizes the absmax values before proceeding.

The underlying computation is dispatched to optimized CUDA/ROCm/CPU kernels via torch.ops.bitsandbytes.quantize_blockwise.

Code Reference

Source	File	Lines
bitsandbytes repo	`bitsandbytes/functional.py`	quantize_blockwise L570-638
bitsandbytes repo	`bitsandbytes/functional.py`	dequantize_blockwise L641-715

Function signature:

def quantize_blockwise(
    A: torch.Tensor,
    code: Optional[torch.Tensor] = None,
    absmax: Optional[torch.Tensor] = None,
    out: Optional[torch.Tensor] = None,
    blocksize: int = 4096,
    nested: bool = False,
) -> tuple[torch.Tensor, QuantState]:

Dequantize signature:

def dequantize_blockwise(
    A: torch.Tensor,
    quant_state: Optional[QuantState] = None,
    absmax: Optional[torch.Tensor] = None,
    code: Optional[torch.Tensor] = None,
    out: Optional[torch.Tensor] = None,
    blocksize: int = 4096,
    nested: bool = False,
) -> torch.Tensor:

Import:

from bitsandbytes.functional import quantize_blockwise, dequantize_blockwise

I/O Contract

quantize_blockwise Inputs

Parameter	Type	Required	Default	Description
`A`	torch.Tensor	Yes	--	Input tensor to quantize. Supports `float16`, `bfloat16`, or `float32` dtypes.
`code`	torch.Tensor	No	signed dynamic map	Quantization codebook (256 levels). Defaults to `create_dynamic_map(signed=True)`.
`absmax`	torch.Tensor	No	None	Pre-allocated tensor for absmax values (deprecated).
`out`	torch.Tensor	No	None	Pre-allocated output tensor (deprecated).
`blocksize`	int	No	4096	Block size for quantization. Valid values: 64, 128, 256, 512, 1024, 2048, 4096.
`nested`	bool	No	False	Whether to additionally quantize the absmax scaling factors for extra compression.

quantize_blockwise Outputs

Output	Type	Description
quantized tensor	torch.Tensor (uint8)	The quantized tensor with values in [0, 255] representing codebook indices
quant_state	QuantState	State object containing: `absmax` (per-block scaling factors), `code` (quantization codebook), `blocksize`, `dtype` (original dtype), and optionally `offset` and `state2` (for nested quantization)

dequantize_blockwise Inputs

Parameter	Type	Required	Default	Description
`A`	torch.Tensor (uint8)	Yes	--	The quantized input tensor
`quant_state`	QuantState	No*	None	The quantization state returned by `quantize_blockwise`. *Required if `absmax` is not provided.
`absmax`	torch.Tensor	No*	None	Per-block scaling factors. *Required if `quant_state` is not provided.
`code`	torch.Tensor	No	signed dynamic map	Quantization codebook. Ignored when `quant_state` is provided.
`out`	torch.Tensor	No	None	Pre-allocated output tensor.
`blocksize`	int	No	4096	Block size. Ignored when `quant_state` is provided.

dequantize_blockwise Output

Output	Type	Description
dequantized tensor	torch.Tensor	The dequantized tensor. Dtype is determined by `quant_state.dtype` (defaults to `float32`).

Usage Examples

Quantize and dequantize a tensor:

import torch
from bitsandbytes.functional import quantize_blockwise, dequantize_blockwise

# Create a sample tensor
tensor = torch.randn(8192, dtype=torch.float32, device="cuda")

# Quantize to 8-bit with blocksize 4096
quantized, quant_state = quantize_blockwise(tensor, blocksize=4096)

# quantized is uint8, quant_state contains absmax and codebook
print(quantized.dtype)       # torch.uint8
print(quant_state.blocksize) # 4096
print(quant_state.absmax.shape)  # torch.Size([2]) -- 8192/4096 = 2 blocks

# Dequantize back to float32
recovered = dequantize_blockwise(quantized, quant_state)
print(recovered.dtype)  # torch.float32

# Check reconstruction error
max_error = (tensor - recovered).abs().max()
print(f"Max reconstruction error: {max_error:.6f}")

With nested quantization for additional compression:

quantized, quant_state = quantize_blockwise(
    tensor, blocksize=4096, nested=True
)

# quant_state now contains nested state for the absmax values
print(quant_state.offset)  # mean of absmax values
print(quant_state.state2)  # nested QuantState for absmax

recovered = dequantize_blockwise(quantized, quant_state)

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment