Implementation:Bitsandbytes foundation Bitsandbytes Quantize Blockwise
| Sources | Repo: bitsandbytes, Paper: 8-bit Optimizers via Block-wise Quantization |
|---|---|
| Domains | Quantization |
Overview
Concrete tool for blockwise tensor quantization to 8-bit provided by the bitsandbytes library. quantize_blockwise divides an input tensor into fixed-size blocks, computes per-block scaling factors, and maps each value to the nearest point in a dynamic quantization codebook of 256 levels.
Description
quantize_blockwise implements the core block-wise quantization algorithm:
- Partition the input tensor into contiguous blocks of
blocksizeelements (default 4096). - Compute per-block absmax: For each block, compute
absmax_i = max(|block_i|)as the scaling factor. - Quantize: Normalize each block by its absmax and map each value to the nearest point in a dynamic quantization map (codebook) of 256 levels for 8-bit output.
- Return the quantized
uint8tensor and aQuantStateobject containing the absmax values, codebook, blocksize, and original dtype.
Dynamic quantization map: If no codebook is provided, the function creates and caches a signed dynamic map via create_dynamic_map(signed=True). This map has 256 non-uniformly spaced levels optimized for the typical distribution of neural network values.
Nested quantization: When nested=True, the absmax scaling factors are further quantized to reduce overhead:
- Compute the mean of all absmax values as an offset.
- Subtract the offset from the absmax values.
- Recursively call
quantize_blockwiseon the centered absmax values (withnested=False). - Store the offset and nested quantization state in the returned
QuantState.
The corresponding dequantize_blockwise function reverses the process: it looks up each quantized index in the codebook, multiplies by the per-block absmax, and returns an FP32/FP16/BF16 tensor. When nested quantization was used, it first dequantizes the absmax values before proceeding.
The underlying computation is dispatched to optimized CUDA/ROCm/CPU kernels via torch.ops.bitsandbytes.quantize_blockwise.
Code Reference
| Source | File | Lines |
|---|---|---|
| bitsandbytes repo | bitsandbytes/functional.py |
quantize_blockwise L570-638 |
| bitsandbytes repo | bitsandbytes/functional.py |
dequantize_blockwise L641-715 |
Function signature:
def quantize_blockwise(
A: torch.Tensor,
code: Optional[torch.Tensor] = None,
absmax: Optional[torch.Tensor] = None,
out: Optional[torch.Tensor] = None,
blocksize: int = 4096,
nested: bool = False,
) -> tuple[torch.Tensor, QuantState]:
Dequantize signature:
def dequantize_blockwise(
A: torch.Tensor,
quant_state: Optional[QuantState] = None,
absmax: Optional[torch.Tensor] = None,
code: Optional[torch.Tensor] = None,
out: Optional[torch.Tensor] = None,
blocksize: int = 4096,
nested: bool = False,
) -> torch.Tensor:
Import:
from bitsandbytes.functional import quantize_blockwise, dequantize_blockwise
I/O Contract
quantize_blockwise Inputs
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
A |
torch.Tensor | Yes | -- | Input tensor to quantize. Supports float16, bfloat16, or float32 dtypes.
|
code |
torch.Tensor | No | signed dynamic map | Quantization codebook (256 levels). Defaults to create_dynamic_map(signed=True).
|
absmax |
torch.Tensor | No | None | Pre-allocated tensor for absmax values (deprecated). |
out |
torch.Tensor | No | None | Pre-allocated output tensor (deprecated). |
blocksize |
int | No | 4096 | Block size for quantization. Valid values: 64, 128, 256, 512, 1024, 2048, 4096. |
nested |
bool | No | False | Whether to additionally quantize the absmax scaling factors for extra compression. |
quantize_blockwise Outputs
| Output | Type | Description |
|---|---|---|
| quantized tensor | torch.Tensor (uint8) | The quantized tensor with values in [0, 255] representing codebook indices |
| quant_state | QuantState | State object containing: absmax (per-block scaling factors), code (quantization codebook), blocksize, dtype (original dtype), and optionally offset and state2 (for nested quantization)
|
dequantize_blockwise Inputs
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
A |
torch.Tensor (uint8) | Yes | -- | The quantized input tensor |
quant_state |
QuantState | No* | None | The quantization state returned by quantize_blockwise. *Required if absmax is not provided.
|
absmax |
torch.Tensor | No* | None | Per-block scaling factors. *Required if quant_state is not provided.
|
code |
torch.Tensor | No | signed dynamic map | Quantization codebook. Ignored when quant_state is provided.
|
out |
torch.Tensor | No | None | Pre-allocated output tensor. |
blocksize |
int | No | 4096 | Block size. Ignored when quant_state is provided.
|
dequantize_blockwise Output
| Output | Type | Description |
|---|---|---|
| dequantized tensor | torch.Tensor | The dequantized tensor. Dtype is determined by quant_state.dtype (defaults to float32).
|
Usage Examples
Quantize and dequantize a tensor:
import torch
from bitsandbytes.functional import quantize_blockwise, dequantize_blockwise
# Create a sample tensor
tensor = torch.randn(8192, dtype=torch.float32, device="cuda")
# Quantize to 8-bit with blocksize 4096
quantized, quant_state = quantize_blockwise(tensor, blocksize=4096)
# quantized is uint8, quant_state contains absmax and codebook
print(quantized.dtype) # torch.uint8
print(quant_state.blocksize) # 4096
print(quant_state.absmax.shape) # torch.Size([2]) -- 8192/4096 = 2 blocks
# Dequantize back to float32
recovered = dequantize_blockwise(quantized, quant_state)
print(recovered.dtype) # torch.float32
# Check reconstruction error
max_error = (tensor - recovered).abs().max()
print(f"Max reconstruction error: {max_error:.6f}")
With nested quantization for additional compression:
quantized, quant_state = quantize_blockwise(
tensor, blocksize=4096, nested=True
)
# quant_state now contains nested state for the absmax values
print(quant_state.offset) # mean of absmax values
print(quant_state.state2) # nested QuantState for absmax
recovered = dequantize_blockwise(quantized, quant_state)