Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Bitsandbytes foundation Bitsandbytes Quantize Blockwise

From Leeroopedia


Sources Repo: bitsandbytes, Paper: 8-bit Optimizers via Block-wise Quantization
Domains Quantization

Overview

Concrete tool for blockwise tensor quantization to 8-bit provided by the bitsandbytes library. quantize_blockwise divides an input tensor into fixed-size blocks, computes per-block scaling factors, and maps each value to the nearest point in a dynamic quantization codebook of 256 levels.

Description

quantize_blockwise implements the core block-wise quantization algorithm:

  1. Partition the input tensor into contiguous blocks of blocksize elements (default 4096).
  2. Compute per-block absmax: For each block, compute absmax_i = max(|block_i|) as the scaling factor.
  3. Quantize: Normalize each block by its absmax and map each value to the nearest point in a dynamic quantization map (codebook) of 256 levels for 8-bit output.
  4. Return the quantized uint8 tensor and a QuantState object containing the absmax values, codebook, blocksize, and original dtype.

Dynamic quantization map: If no codebook is provided, the function creates and caches a signed dynamic map via create_dynamic_map(signed=True). This map has 256 non-uniformly spaced levels optimized for the typical distribution of neural network values.

Nested quantization: When nested=True, the absmax scaling factors are further quantized to reduce overhead:

  1. Compute the mean of all absmax values as an offset.
  2. Subtract the offset from the absmax values.
  3. Recursively call quantize_blockwise on the centered absmax values (with nested=False).
  4. Store the offset and nested quantization state in the returned QuantState.

The corresponding dequantize_blockwise function reverses the process: it looks up each quantized index in the codebook, multiplies by the per-block absmax, and returns an FP32/FP16/BF16 tensor. When nested quantization was used, it first dequantizes the absmax values before proceeding.

The underlying computation is dispatched to optimized CUDA/ROCm/CPU kernels via torch.ops.bitsandbytes.quantize_blockwise.

Code Reference

Source File Lines
bitsandbytes repo bitsandbytes/functional.py quantize_blockwise L570-638
bitsandbytes repo bitsandbytes/functional.py dequantize_blockwise L641-715

Function signature:

def quantize_blockwise(
    A: torch.Tensor,
    code: Optional[torch.Tensor] = None,
    absmax: Optional[torch.Tensor] = None,
    out: Optional[torch.Tensor] = None,
    blocksize: int = 4096,
    nested: bool = False,
) -> tuple[torch.Tensor, QuantState]:

Dequantize signature:

def dequantize_blockwise(
    A: torch.Tensor,
    quant_state: Optional[QuantState] = None,
    absmax: Optional[torch.Tensor] = None,
    code: Optional[torch.Tensor] = None,
    out: Optional[torch.Tensor] = None,
    blocksize: int = 4096,
    nested: bool = False,
) -> torch.Tensor:

Import:

from bitsandbytes.functional import quantize_blockwise, dequantize_blockwise

I/O Contract

quantize_blockwise Inputs

Parameter Type Required Default Description
A torch.Tensor Yes -- Input tensor to quantize. Supports float16, bfloat16, or float32 dtypes.
code torch.Tensor No signed dynamic map Quantization codebook (256 levels). Defaults to create_dynamic_map(signed=True).
absmax torch.Tensor No None Pre-allocated tensor for absmax values (deprecated).
out torch.Tensor No None Pre-allocated output tensor (deprecated).
blocksize int No 4096 Block size for quantization. Valid values: 64, 128, 256, 512, 1024, 2048, 4096.
nested bool No False Whether to additionally quantize the absmax scaling factors for extra compression.

quantize_blockwise Outputs

Output Type Description
quantized tensor torch.Tensor (uint8) The quantized tensor with values in [0, 255] representing codebook indices
quant_state QuantState State object containing: absmax (per-block scaling factors), code (quantization codebook), blocksize, dtype (original dtype), and optionally offset and state2 (for nested quantization)

dequantize_blockwise Inputs

Parameter Type Required Default Description
A torch.Tensor (uint8) Yes -- The quantized input tensor
quant_state QuantState No* None The quantization state returned by quantize_blockwise. *Required if absmax is not provided.
absmax torch.Tensor No* None Per-block scaling factors. *Required if quant_state is not provided.
code torch.Tensor No signed dynamic map Quantization codebook. Ignored when quant_state is provided.
out torch.Tensor No None Pre-allocated output tensor.
blocksize int No 4096 Block size. Ignored when quant_state is provided.

dequantize_blockwise Output

Output Type Description
dequantized tensor torch.Tensor The dequantized tensor. Dtype is determined by quant_state.dtype (defaults to float32).

Usage Examples

Quantize and dequantize a tensor:

import torch
from bitsandbytes.functional import quantize_blockwise, dequantize_blockwise

# Create a sample tensor
tensor = torch.randn(8192, dtype=torch.float32, device="cuda")

# Quantize to 8-bit with blocksize 4096
quantized, quant_state = quantize_blockwise(tensor, blocksize=4096)

# quantized is uint8, quant_state contains absmax and codebook
print(quantized.dtype)       # torch.uint8
print(quant_state.blocksize) # 4096
print(quant_state.absmax.shape)  # torch.Size([2]) -- 8192/4096 = 2 blocks

# Dequantize back to float32
recovered = dequantize_blockwise(quantized, quant_state)
print(recovered.dtype)  # torch.float32

# Check reconstruction error
max_error = (tensor - recovered).abs().max()
print(f"Max reconstruction error: {max_error:.6f}")

With nested quantization for additional compression:

quantized, quant_state = quantize_blockwise(
    tensor, blocksize=4096, nested=True
)

# quant_state now contains nested state for the absmax values
print(quant_state.offset)  # mean of absmax values
print(quant_state.state2)  # nested QuantState for absmax

recovered = dequantize_blockwise(quantized, quant_state)

Related

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment