Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Bitsandbytes foundation Bitsandbytes BitsAndBytesConfig 4bit

From Leeroopedia


Metadata

Field Value
Page Type Implementation (Wrapper Doc)
Knowledge Sources Repo (bitsandbytes), Doc (HuggingFace Transformers), Paper (QLoRA)
Domains Quantization, NLP
Last Updated 2026-02-07 14:00 GMT

Overview

Concrete tool for configuring 4-bit quantization parameters provided by the HuggingFace Transformers library.

Description

BitsAndBytesConfig is a configuration class in the HuggingFace Transformers library that acts as the primary user-facing interface for specifying 4-bit quantization settings. It does not perform quantization itself; instead, it packages the quantization parameters and passes them to the bitsandbytes library during model loading.

When load_in_4bit=True is set, the Transformers model loading pipeline (e.g., AutoModelForCausalLM.from_pretrained) uses this configuration to:

  1. Replace standard nn.Linear layers with bitsandbytes Linear4bit layers.
  2. Configure each Linear4bit layer with the specified quantization type, compute dtype, and double quantization settings.
  3. Trigger lazy quantization when the model weights are transferred to the GPU device.

The configuration object bridges the gap between the high-level Transformers API and the low-level bitsandbytes quantization primitives, allowing users to control quantization behavior through a single, declarative interface.

Usage

Import BitsAndBytesConfig when you need to load a pretrained model in 4-bit precision for memory-efficient inference or QLoRA fine-tuning. It is typically passed as the quantization_config argument to from_pretrained.

Code Reference

Source Location

External (HuggingFace Transformers library, transformers.utils.quantization_config).

Signature

transformers.BitsAndBytesConfig(
    load_in_4bit: bool = False,
    bnb_4bit_quant_type: str = "fp4",
    bnb_4bit_compute_dtype: Optional[torch.dtype] = None,
    bnb_4bit_use_double_quant: bool = False,
    bnb_4bit_quant_storage: torch.dtype = torch.uint8,
)

Import

from transformers import BitsAndBytesConfig

I/O Contract

Inputs

Parameter Type Required Description
load_in_4bit bool Yes Enables 4-bit quantization when set to True. Must be True for 4-bit inference.
bnb_4bit_quant_type str No The quantization data type. Either "nf4" (NormalFloat4, recommended) or "fp4" (4-bit floating point). Defaults to "fp4".
bnb_4bit_compute_dtype torch.dtype No The dtype used for computation during the forward pass. Common values: torch.bfloat16, torch.float16, torch.float32. Defaults to None (resolved at runtime based on input dtype).
bnb_4bit_use_double_quant bool No Whether to apply double quantization (quantize the quantization constants). Reduces memory overhead at negligible accuracy cost. Defaults to False.
bnb_4bit_quant_storage torch.dtype No The dtype of the tensor used to physically store the packed 4-bit values. Defaults to torch.uint8.

Outputs

Output Type Description
Config object BitsAndBytesConfig A configuration object passed to from_pretrained via the quantization_config parameter. Controls how Linear4bit layers are constructed during model loading.

Usage Examples

Loading a Model with NF4 + Double Quantization + BFloat16 Compute

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

# Configure 4-bit quantization
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

# Load model with 4-bit quantization
model_name = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quantization_config,
    device_map="auto",
)

# Run inference
inputs = tokenizer("Hello, world!", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Minimal FP4 Configuration

from transformers import BitsAndBytesConfig

# FP4 with default settings
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="fp4",
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment