Implementation:Bitsandbytes foundation Bitsandbytes Quantize 4bit

Metadata

Field	Value
Page Type	Implementation (API Doc)
Knowledge Sources	Repo (bitsandbytes), Paper (QLoRA)
Domains	Quantization
Last Updated	2026-02-07 14:00 GMT

Overview

Concrete tool for quantizing tensors to 4-bit precision provided by the bitsandbytes library.

Description

quantize_4bit is the core low-level function that performs blockwise 4-bit quantization of a floating-point tensor. It divides the input tensor into contiguous blocks, computes a per-block absolute maximum scaling factor, quantizes each element to a 4-bit codebook entry (NF4 or FP4), and packs two 4-bit values per byte into the output tensor.

The function operates as follows:

Determine block size: If not specified, defaults to 64 on CUDA or 128 on ROCm (based on warp size).
Dispatch to native kernel: Calls torch.ops.bitsandbytes.quantize_4bit.default, which executes the GPU kernel that performs per-block absmax computation, normalization, codebook lookup, and packing in a single fused operation.
Build codebook: Retrieves the 4-bit type codebook via get_4bit_type(quant_type) for inclusion in the quantization state.
Optional double quantization: If compress_statistics=True, the absmax values are further compressed:
- Compute the mean of all absmax values (stored as a float32 offset).
- Subtract the mean.
- Quantize the centered absmax values to 8-bit using quantize_blockwise with a block size of 256.
- Store the 8-bit quantized absmax, the second-level state, and the offset in a nested QuantState.
Construct QuantState: Package all metadata (absmax, original shape, dtype, blocksize, codebook, quant_type, and optional nested state) into a QuantState object.

The function returns a tuple of the packed 4-bit tensor and the QuantState needed for dequantization.

Code Reference

Source Location

bitsandbytes repo, file: bitsandbytes/functional.py, lines L826-904.

Signature

def quantize_4bit(
    A: torch.Tensor,
    absmax: Optional[torch.Tensor] = None,
    out: Optional[torch.Tensor] = None,
    blocksize=None,
    compress_statistics=False,
    quant_type="fp4",
    quant_storage=torch.uint8,
) -> tuple[torch.Tensor, QuantState]:

Import

from bitsandbytes.functional import quantize_4bit

I/O Contract

Inputs

Parameter	Type	Required	Description
`A`	torch.Tensor	Yes	The input tensor to quantize. Supports `float16`, `bfloat16`, or `float32` dtypes.
`absmax`	torch.Tensor	No	Pre-allocated tensor for storing per-block absolute maximum values. If provided, the computed absmax is copied into this tensor.
`out`	torch.Tensor	No	Pre-allocated output tensor for the packed 4-bit result. If provided, the result is copied into this tensor.
`blocksize`	int	No	The number of elements per quantization block. Defaults to 64 on CUDA, 128 on ROCm. Valid values: 64, 128, 256, 512, 1024, 2048, 4096.
`compress_statistics`	bool	No	Whether to apply double quantization to the absmax scaling factors. Defaults to `False`.
`quant_type`	str	No	The quantization data type: `"fp4"` or `"nf4"`. Defaults to `"fp4"`.
`quant_storage`	torch.dtype	No	The dtype used to store the packed 4-bit output. Defaults to `torch.uint8`.

Outputs

Output	Type	Description
Packed tensor	torch.Tensor	The quantized tensor with packed 4-bit values stored in `quant_storage` dtype. Shape is resized so that two 4-bit values share one byte.
Quant state	QuantState	Metadata required for dequantization, containing: `absmax` (per-block scaling factors), `shape` (original tensor shape), `code` (4-bit codebook), `blocksize`, `quant_type`, `dtype` (original dtype). When `compress_statistics=True`, also contains `offset` (float32 mean of absmax) and `state2` (nested 8-bit quantization state for the absmax values).

Usage Examples

Quantize and Dequantize a Tensor

import torch
from bitsandbytes.functional import quantize_4bit, dequantize_4bit

# Create a float16 tensor
original = torch.randn(4096, 4096, dtype=torch.float16, device="cuda")

# Quantize to NF4 with double quantization
packed, quant_state = quantize_4bit(
    original,
    blocksize=64,
    compress_statistics=True,
    quant_type="nf4",
)

print(f"Original size: {original.nelement() * 2} bytes")      # 33554432 bytes
print(f"Packed size:   {packed.nelement()} bytes")             # ~8388608 bytes
print(f"Compression:   ~{original.nelement() * 2 / packed.nelement():.1f}x")

# Dequantize back to float16
reconstructed = dequantize_4bit(packed, quant_state)
print(f"Reconstruction error (MSE): {(original - reconstructed).pow(2).mean().item():.6f}")

Quantize with FP4 (No Double Quantization)

import torch
from bitsandbytes.functional import quantize_4bit

tensor = torch.randn(1024, dtype=torch.bfloat16, device="cuda")

packed, quant_state = quantize_4bit(
    tensor,
    compress_statistics=False,
    quant_type="fp4",
)

# Inspect the quantization state
print(f"Block size: {quant_state.blocksize}")
print(f"Quant type: {quant_state.quant_type}")
print(f"Original dtype: {quant_state.dtype}")
print(f"Number of blocks: {quant_state.absmax.numel()}")
print(f"Codebook entries: {quant_state.code.numel()}")  # 16 entries

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment