Implementation:Bitsandbytes foundation Bitsandbytes Quantize 4bit
Metadata
| Field | Value |
|---|---|
| Page Type | Implementation (API Doc) |
| Knowledge Sources | Repo (bitsandbytes), Paper (QLoRA) |
| Domains | Quantization |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
Concrete tool for quantizing tensors to 4-bit precision provided by the bitsandbytes library.
Description
quantize_4bit is the core low-level function that performs blockwise 4-bit quantization of a floating-point tensor. It divides the input tensor into contiguous blocks, computes a per-block absolute maximum scaling factor, quantizes each element to a 4-bit codebook entry (NF4 or FP4), and packs two 4-bit values per byte into the output tensor.
The function operates as follows:
- Determine block size: If not specified, defaults to 64 on CUDA or 128 on ROCm (based on warp size).
- Dispatch to native kernel: Calls
torch.ops.bitsandbytes.quantize_4bit.default, which executes the GPU kernel that performs per-block absmax computation, normalization, codebook lookup, and packing in a single fused operation. - Build codebook: Retrieves the 4-bit type codebook via
get_4bit_type(quant_type)for inclusion in the quantization state. - Optional double quantization: If
compress_statistics=True, the absmax values are further compressed:- Compute the mean of all absmax values (stored as a float32
offset). - Subtract the mean.
- Quantize the centered absmax values to 8-bit using
quantize_blockwisewith a block size of 256. - Store the 8-bit quantized absmax, the second-level state, and the offset in a nested
QuantState.
- Compute the mean of all absmax values (stored as a float32
- Construct QuantState: Package all metadata (absmax, original shape, dtype, blocksize, codebook, quant_type, and optional nested state) into a
QuantStateobject.
The function returns a tuple of the packed 4-bit tensor and the QuantState needed for dequantization.
Code Reference
Source Location
bitsandbytes repo, file: bitsandbytes/functional.py, lines L826-904.
Signature
def quantize_4bit(
A: torch.Tensor,
absmax: Optional[torch.Tensor] = None,
out: Optional[torch.Tensor] = None,
blocksize=None,
compress_statistics=False,
quant_type="fp4",
quant_storage=torch.uint8,
) -> tuple[torch.Tensor, QuantState]:
Import
from bitsandbytes.functional import quantize_4bit
I/O Contract
Inputs
| Parameter | Type | Required | Description |
|---|---|---|---|
A |
torch.Tensor | Yes | The input tensor to quantize. Supports float16, bfloat16, or float32 dtypes.
|
absmax |
torch.Tensor | No | Pre-allocated tensor for storing per-block absolute maximum values. If provided, the computed absmax is copied into this tensor. |
out |
torch.Tensor | No | Pre-allocated output tensor for the packed 4-bit result. If provided, the result is copied into this tensor. |
blocksize |
int | No | The number of elements per quantization block. Defaults to 64 on CUDA, 128 on ROCm. Valid values: 64, 128, 256, 512, 1024, 2048, 4096. |
compress_statistics |
bool | No | Whether to apply double quantization to the absmax scaling factors. Defaults to False.
|
quant_type |
str | No | The quantization data type: "fp4" or "nf4". Defaults to "fp4".
|
quant_storage |
torch.dtype | No | The dtype used to store the packed 4-bit output. Defaults to torch.uint8.
|
Outputs
| Output | Type | Description |
|---|---|---|
| Packed tensor | torch.Tensor | The quantized tensor with packed 4-bit values stored in quant_storage dtype. Shape is resized so that two 4-bit values share one byte.
|
| Quant state | QuantState | Metadata required for dequantization, containing: absmax (per-block scaling factors), shape (original tensor shape), code (4-bit codebook), blocksize, quant_type, dtype (original dtype). When compress_statistics=True, also contains offset (float32 mean of absmax) and state2 (nested 8-bit quantization state for the absmax values).
|
Usage Examples
Quantize and Dequantize a Tensor
import torch
from bitsandbytes.functional import quantize_4bit, dequantize_4bit
# Create a float16 tensor
original = torch.randn(4096, 4096, dtype=torch.float16, device="cuda")
# Quantize to NF4 with double quantization
packed, quant_state = quantize_4bit(
original,
blocksize=64,
compress_statistics=True,
quant_type="nf4",
)
print(f"Original size: {original.nelement() * 2} bytes") # 33554432 bytes
print(f"Packed size: {packed.nelement()} bytes") # ~8388608 bytes
print(f"Compression: ~{original.nelement() * 2 / packed.nelement():.1f}x")
# Dequantize back to float16
reconstructed = dequantize_4bit(packed, quant_state)
print(f"Reconstruction error (MSE): {(original - reconstructed).pow(2).mean().item():.6f}")
Quantize with FP4 (No Double Quantization)
import torch
from bitsandbytes.functional import quantize_4bit
tensor = torch.randn(1024, dtype=torch.bfloat16, device="cuda")
packed, quant_state = quantize_4bit(
tensor,
compress_statistics=False,
quant_type="fp4",
)
# Inspect the quantization state
print(f"Block size: {quant_state.blocksize}")
print(f"Quant type: {quant_state.quant_type}")
print(f"Original dtype: {quant_state.dtype}")
print(f"Number of blocks: {quant_state.absmax.numel()}")
print(f"Codebook entries: {quant_state.code.numel()}") # 16 entries