Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Bitsandbytes foundation Bitsandbytes Quantize 4bit

From Leeroopedia


Metadata

Field Value
Page Type Implementation (API Doc)
Knowledge Sources Repo (bitsandbytes), Paper (QLoRA)
Domains Quantization
Last Updated 2026-02-07 14:00 GMT

Overview

Concrete tool for quantizing tensors to 4-bit precision provided by the bitsandbytes library.

Description

quantize_4bit is the core low-level function that performs blockwise 4-bit quantization of a floating-point tensor. It divides the input tensor into contiguous blocks, computes a per-block absolute maximum scaling factor, quantizes each element to a 4-bit codebook entry (NF4 or FP4), and packs two 4-bit values per byte into the output tensor.

The function operates as follows:

  1. Determine block size: If not specified, defaults to 64 on CUDA or 128 on ROCm (based on warp size).
  2. Dispatch to native kernel: Calls torch.ops.bitsandbytes.quantize_4bit.default, which executes the GPU kernel that performs per-block absmax computation, normalization, codebook lookup, and packing in a single fused operation.
  3. Build codebook: Retrieves the 4-bit type codebook via get_4bit_type(quant_type) for inclusion in the quantization state.
  4. Optional double quantization: If compress_statistics=True, the absmax values are further compressed:
    • Compute the mean of all absmax values (stored as a float32 offset).
    • Subtract the mean.
    • Quantize the centered absmax values to 8-bit using quantize_blockwise with a block size of 256.
    • Store the 8-bit quantized absmax, the second-level state, and the offset in a nested QuantState.
  5. Construct QuantState: Package all metadata (absmax, original shape, dtype, blocksize, codebook, quant_type, and optional nested state) into a QuantState object.

The function returns a tuple of the packed 4-bit tensor and the QuantState needed for dequantization.

Code Reference

Source Location

bitsandbytes repo, file: bitsandbytes/functional.py, lines L826-904.

Signature

def quantize_4bit(
    A: torch.Tensor,
    absmax: Optional[torch.Tensor] = None,
    out: Optional[torch.Tensor] = None,
    blocksize=None,
    compress_statistics=False,
    quant_type="fp4",
    quant_storage=torch.uint8,
) -> tuple[torch.Tensor, QuantState]:

Import

from bitsandbytes.functional import quantize_4bit

I/O Contract

Inputs

Parameter Type Required Description
A torch.Tensor Yes The input tensor to quantize. Supports float16, bfloat16, or float32 dtypes.
absmax torch.Tensor No Pre-allocated tensor for storing per-block absolute maximum values. If provided, the computed absmax is copied into this tensor.
out torch.Tensor No Pre-allocated output tensor for the packed 4-bit result. If provided, the result is copied into this tensor.
blocksize int No The number of elements per quantization block. Defaults to 64 on CUDA, 128 on ROCm. Valid values: 64, 128, 256, 512, 1024, 2048, 4096.
compress_statistics bool No Whether to apply double quantization to the absmax scaling factors. Defaults to False.
quant_type str No The quantization data type: "fp4" or "nf4". Defaults to "fp4".
quant_storage torch.dtype No The dtype used to store the packed 4-bit output. Defaults to torch.uint8.

Outputs

Output Type Description
Packed tensor torch.Tensor The quantized tensor with packed 4-bit values stored in quant_storage dtype. Shape is resized so that two 4-bit values share one byte.
Quant state QuantState Metadata required for dequantization, containing: absmax (per-block scaling factors), shape (original tensor shape), code (4-bit codebook), blocksize, quant_type, dtype (original dtype). When compress_statistics=True, also contains offset (float32 mean of absmax) and state2 (nested 8-bit quantization state for the absmax values).

Usage Examples

Quantize and Dequantize a Tensor

import torch
from bitsandbytes.functional import quantize_4bit, dequantize_4bit

# Create a float16 tensor
original = torch.randn(4096, 4096, dtype=torch.float16, device="cuda")

# Quantize to NF4 with double quantization
packed, quant_state = quantize_4bit(
    original,
    blocksize=64,
    compress_statistics=True,
    quant_type="nf4",
)

print(f"Original size: {original.nelement() * 2} bytes")      # 33554432 bytes
print(f"Packed size:   {packed.nelement()} bytes")             # ~8388608 bytes
print(f"Compression:   ~{original.nelement() * 2 / packed.nelement():.1f}x")

# Dequantize back to float16
reconstructed = dequantize_4bit(packed, quant_state)
print(f"Reconstruction error (MSE): {(original - reconstructed).pow(2).mean().item():.6f}")

Quantize with FP4 (No Double Quantization)

import torch
from bitsandbytes.functional import quantize_4bit

tensor = torch.randn(1024, dtype=torch.bfloat16, device="cuda")

packed, quant_state = quantize_4bit(
    tensor,
    compress_statistics=False,
    quant_type="fp4",
)

# Inspect the quantization state
print(f"Block size: {quant_state.blocksize}")
print(f"Quant type: {quant_state.quant_type}")
print(f"Original dtype: {quant_state.dtype}")
print(f"Number of blocks: {quant_state.absmax.numel()}")
print(f"Codebook entries: {quant_state.code.numel()}")  # 16 entries

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment