Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Bitsandbytes foundation Bitsandbytes 4bit Quantization Configuration

From Leeroopedia


Metadata

Field Value
Page Type Principle
Knowledge Sources Paper (QLoRA: Efficient Finetuning of Quantized LLMs), Repo (bitsandbytes)
Domains Quantization, NLP
Last Updated 2026-02-07 14:00 GMT

Overview

Configuring 4-bit quantization parameters -- including the choice of quantization data type (NF4 or FP4), compute precision, and double quantization -- to achieve memory-efficient model inference with controlled accuracy tradeoffs.

Description

4-bit quantization configuration determines how model weights are compressed from 16-bit or 32-bit floating point down to 4-bit representations. The configuration involves several interdependent choices:

Quantization Type: NF4 vs FP4

Two 4-bit data types are available:

  • NF4 (NormalFloat4): An information-theoretically optimal data type for quantizing values drawn from a normal distribution. NF4 places its 16 quantization levels such that each bin covers an equal probability mass under a standard normal distribution N(0, 1), normalized to the range [-1, 1]. Because pretrained neural network weights are approximately normally distributed, NF4 preserves more information per bit than uniformly spaced quantization.
  • FP4 (4-bit Floating Point): A conventional 4-bit floating point format with sign, exponent, and mantissa bits. FP4 provides a uniform-in-log-space representation and may be preferable when weight distributions deviate significantly from normality.

In practice, NF4 yields better accuracy for most pretrained language models.

Compute Dtype

The compute dtype specifies the precision used for arithmetic during the forward pass. Weights are stored in 4-bit but must be dequantized to a higher-precision format for matrix multiplication. Common choices include:

  • bfloat16: Recommended for most use cases. Provides good numerical stability with the same exponent range as float32 and fast hardware support on modern GPUs.
  • float16: Offers slightly higher mantissa precision than bfloat16 but with a narrower dynamic range.
  • float32: Maximum precision but significantly slower on GPU; generally not recommended unless required for numerical analysis.

Double Quantization (compress_statistics)

Double quantization is a technique that further compresses the quantization constants (the per-block absolute maximum values, or absmax values) themselves. In standard blockwise quantization, each block of weights requires one float32 absmax scaling factor. Double quantization applies a second round of 8-bit blockwise quantization to these absmax values, reducing their memory footprint from 32 bits to approximately 8 bits per block plus a small overhead for the second-level quantization state. This results in additional memory savings of roughly 0.37 bits per parameter at negligible accuracy cost.

Memory and Accuracy Tradeoffs

The combination of these parameters determines the overall memory footprint and model quality:

  • NF4 + double quantization achieves approximately 4.37 bits per parameter total overhead (4 bits for weights, ~0.37 bits for quantization constants).
  • Without double quantization, the overhead is approximately 4.5 bits per parameter.
  • The compute dtype affects inference speed more than accuracy; using bfloat16 is typically 2-4x faster than float32 with minimal quality difference.

Usage

4-bit quantization configuration is applied when:

  • Large models exceed GPU memory: A 70B parameter model at 16-bit requires approximately 140 GB of memory, but at 4-bit requires roughly 35 GB, fitting on a single 48 GB GPU.
  • QLoRA fine-tuning: The base model is frozen in 4-bit while LoRA adapter weights are trained in higher precision. NF4 with double quantization is the standard configuration for QLoRA.
  • Memory-efficient inference: Running inference on consumer-grade GPUs with limited VRAM. 4-bit quantization enables serving models that would otherwise require multi-GPU setups.
  • Exploring model behavior: Quickly loading and testing large models on available hardware before committing to full-precision deployment.

Theoretical Basis

NormalFloat4 (NF4)

The NF4 data type is derived from information theory. Given that pretrained neural network weights follow an approximately normal distribution, the optimal fixed-point quantization assigns equal probability mass to each quantization bin. For a standard normal distribution N(0, 1):

  1. Compute 2^k + 1 equally spaced quantiles of N(0, 1), where k = 4 (yielding 17 quantile boundaries for 16 bins).
  2. Take the midpoint of each consecutive pair of quantiles as the quantization level.
  3. Normalize the resulting codebook to the range [-1, 1].

This ensures that each of the 16 representable values captures an equal share of the probability density, maximizing information retention for normally distributed data.

FP4

FP4 uses a fixed-point encoding with sign, exponent, and mantissa bits. It provides logarithmically spaced quantization levels, which can be advantageous for distributions with heavy tails or significant deviation from normality.

Blockwise Quantization

Rather than quantizing an entire weight matrix with a single scaling factor, blockwise quantization divides the flattened weight tensor into contiguous blocks (typically 64 elements on CUDA, 128 on ROCm). Each block is independently scaled by its absolute maximum value:

  1. Flatten the weight tensor.
  2. Divide into blocks of blocksize consecutive elements.
  3. For each block, compute absmax = max(|w_i|) for all elements w_i in the block.
  4. Scale elements to the range [-1, 1] by dividing by absmax.
  5. Map each scaled element to the nearest value in the NF4 or FP4 codebook.

This per-block scaling allows finer-grained adaptation to local weight distributions, preserving more information than global quantization.

Double Quantization

The per-block absmax values themselves occupy memory (one float32 per block). Double quantization compresses these constants:

  1. Collect all absmax values across the tensor.
  2. Subtract the mean (stored as a float32 offset).
  3. Quantize the centered absmax values using 8-bit blockwise quantization with a block size of 256.

This reduces the per-parameter overhead of quantization constants from approximately 0.5 bits (32 bits / 64 elements per block) to approximately 0.127 bits, yielding a total overhead near 4.127 bits per parameter when combined with 4-bit weights.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment