Implementation:Bitsandbytes foundation Bitsandbytes BitsAndBytesConfig 4bit

Metadata

Field	Value
Page Type	Implementation (Wrapper Doc)
Knowledge Sources	Repo (bitsandbytes), Doc (HuggingFace Transformers), Paper (QLoRA)
Domains	Quantization, NLP
Last Updated	2026-02-07 14:00 GMT

Overview

Concrete tool for configuring 4-bit quantization parameters provided by the HuggingFace Transformers library.

Description

BitsAndBytesConfig is a configuration class in the HuggingFace Transformers library that acts as the primary user-facing interface for specifying 4-bit quantization settings. It does not perform quantization itself; instead, it packages the quantization parameters and passes them to the bitsandbytes library during model loading.

When load_in_4bit=True is set, the Transformers model loading pipeline (e.g., AutoModelForCausalLM.from_pretrained) uses this configuration to:

Replace standard nn.Linear layers with bitsandbytes Linear4bit layers.
Configure each Linear4bit layer with the specified quantization type, compute dtype, and double quantization settings.
Trigger lazy quantization when the model weights are transferred to the GPU device.

The configuration object bridges the gap between the high-level Transformers API and the low-level bitsandbytes quantization primitives, allowing users to control quantization behavior through a single, declarative interface.

Usage

Import BitsAndBytesConfig when you need to load a pretrained model in 4-bit precision for memory-efficient inference or QLoRA fine-tuning. It is typically passed as the quantization_config argument to from_pretrained.

Code Reference

Source Location

External (HuggingFace Transformers library, transformers.utils.quantization_config).

Signature

transformers.BitsAndBytesConfig(
    load_in_4bit: bool = False,
    bnb_4bit_quant_type: str = "fp4",
    bnb_4bit_compute_dtype: Optional[torch.dtype] = None,
    bnb_4bit_use_double_quant: bool = False,
    bnb_4bit_quant_storage: torch.dtype = torch.uint8,
)

Import

from transformers import BitsAndBytesConfig

I/O Contract

Inputs

Parameter	Type	Required	Description
`load_in_4bit`	bool	Yes	Enables 4-bit quantization when set to `True`. Must be `True` for 4-bit inference.
`bnb_4bit_quant_type`	str	No	The quantization data type. Either `"nf4"` (NormalFloat4, recommended) or `"fp4"` (4-bit floating point). Defaults to `"fp4"`.
`bnb_4bit_compute_dtype`	torch.dtype	No	The dtype used for computation during the forward pass. Common values: `torch.bfloat16`, `torch.float16`, `torch.float32`. Defaults to `None` (resolved at runtime based on input dtype).
`bnb_4bit_use_double_quant`	bool	No	Whether to apply double quantization (quantize the quantization constants). Reduces memory overhead at negligible accuracy cost. Defaults to `False`.
`bnb_4bit_quant_storage`	torch.dtype	No	The dtype of the tensor used to physically store the packed 4-bit values. Defaults to `torch.uint8`.

Outputs

Output	Type	Description
Config object	`BitsAndBytesConfig`	A configuration object passed to `from_pretrained` via the `quantization_config` parameter. Controls how Linear4bit layers are constructed during model loading.

Usage Examples

Loading a Model with NF4 + Double Quantization + BFloat16 Compute

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

# Configure 4-bit quantization
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

# Load model with 4-bit quantization
model_name = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quantization_config,
    device_map="auto",
)

# Run inference
inputs = tokenizer("Hello, world!", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Minimal FP4 Configuration

from transformers import BitsAndBytesConfig

# FP4 with default settings
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="fp4",
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment