Implementation:Bitsandbytes foundation Bitsandbytes BitsAndBytesConfig 4bit
Metadata
| Field | Value |
|---|---|
| Page Type | Implementation (Wrapper Doc) |
| Knowledge Sources | Repo (bitsandbytes), Doc (HuggingFace Transformers), Paper (QLoRA) |
| Domains | Quantization, NLP |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
Concrete tool for configuring 4-bit quantization parameters provided by the HuggingFace Transformers library.
Description
BitsAndBytesConfig is a configuration class in the HuggingFace Transformers library that acts as the primary user-facing interface for specifying 4-bit quantization settings. It does not perform quantization itself; instead, it packages the quantization parameters and passes them to the bitsandbytes library during model loading.
When load_in_4bit=True is set, the Transformers model loading pipeline (e.g., AutoModelForCausalLM.from_pretrained) uses this configuration to:
- Replace standard
nn.Linearlayers with bitsandbytesLinear4bitlayers. - Configure each
Linear4bitlayer with the specified quantization type, compute dtype, and double quantization settings. - Trigger lazy quantization when the model weights are transferred to the GPU device.
The configuration object bridges the gap between the high-level Transformers API and the low-level bitsandbytes quantization primitives, allowing users to control quantization behavior through a single, declarative interface.
Usage
Import BitsAndBytesConfig when you need to load a pretrained model in 4-bit precision for memory-efficient inference or QLoRA fine-tuning. It is typically passed as the quantization_config argument to from_pretrained.
Code Reference
Source Location
External (HuggingFace Transformers library, transformers.utils.quantization_config).
Signature
transformers.BitsAndBytesConfig(
load_in_4bit: bool = False,
bnb_4bit_quant_type: str = "fp4",
bnb_4bit_compute_dtype: Optional[torch.dtype] = None,
bnb_4bit_use_double_quant: bool = False,
bnb_4bit_quant_storage: torch.dtype = torch.uint8,
)
Import
from transformers import BitsAndBytesConfig
I/O Contract
Inputs
| Parameter | Type | Required | Description |
|---|---|---|---|
load_in_4bit |
bool | Yes | Enables 4-bit quantization when set to True. Must be True for 4-bit inference.
|
bnb_4bit_quant_type |
str | No | The quantization data type. Either "nf4" (NormalFloat4, recommended) or "fp4" (4-bit floating point). Defaults to "fp4".
|
bnb_4bit_compute_dtype |
torch.dtype | No | The dtype used for computation during the forward pass. Common values: torch.bfloat16, torch.float16, torch.float32. Defaults to None (resolved at runtime based on input dtype).
|
bnb_4bit_use_double_quant |
bool | No | Whether to apply double quantization (quantize the quantization constants). Reduces memory overhead at negligible accuracy cost. Defaults to False.
|
bnb_4bit_quant_storage |
torch.dtype | No | The dtype of the tensor used to physically store the packed 4-bit values. Defaults to torch.uint8.
|
Outputs
| Output | Type | Description |
|---|---|---|
| Config object | BitsAndBytesConfig |
A configuration object passed to from_pretrained via the quantization_config parameter. Controls how Linear4bit layers are constructed during model loading.
|
Usage Examples
Loading a Model with NF4 + Double Quantization + BFloat16 Compute
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
# Configure 4-bit quantization
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
# Load model with 4-bit quantization
model_name = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=quantization_config,
device_map="auto",
)
# Run inference
inputs = tokenizer("Hello, world!", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Minimal FP4 Configuration
from transformers import BitsAndBytesConfig
# FP4 with default settings
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="fp4",
)