Implementation:Huggingface Transformers BitsAndBytesConfig

Knowledge Sources	Transformers BitsAndBytes Integration
Domains	Model_Optimization, Quantization, Configuration
Last Updated	2026-02-13 00:00 GMT

Overview

Concrete API for configuring BitsAndBytes quantization provided by Hugging Face Transformers.

Description

BitsAndBytesConfig is a dataclass that wraps all parameters for the bitsandbytes quantization library. It supports two quantization modes: 8-bit (LLM.int8()) and 4-bit (FP4/NF4). The class validates parameter types and mutual exclusivity constraints in its post_init() method, supports serialization to and from dictionaries and JSON, and integrates with the AutoHfQuantizer dispatcher to instantiate the correct quantizer backend.

The class is defined at line 384 of quantization_config.py and inherits from QuantizationConfigMixin. The quant_method field is automatically set to QuantizationMethod.BITS_AND_BYTES.

Usage

Use this API whenever you want to load a model with BitsAndBytes quantization, whether for memory-efficient inference or as a prerequisite for QLoRA fine-tuning.

Code Reference

Source Location

Repository: transformers
File: src/transformers/utils/quantization_config.py (lines 384-601)

Signature

@dataclass
class BitsAndBytesConfig(QuantizationConfigMixin):
    def __init__(
        self,
        load_in_8bit: bool = False,
        load_in_4bit: bool = False,
        llm_int8_threshold: float = 6.0,
        llm_int8_skip_modules: list[str] | None = None,
        llm_int8_enable_fp32_cpu_offload: bool = False,
        llm_int8_has_fp16_weight: bool = False,
        bnb_4bit_compute_dtype: torch.dtype | str | None = None,
        bnb_4bit_quant_type: str = "fp4",
        bnb_4bit_use_double_quant: bool = False,
        bnb_4bit_quant_storage: torch.dtype | str | None = None,
        **kwargs,
    ): ...

Import

from transformers import BitsAndBytesConfig

I/O Contract

Inputs

Name	Type	Required	Description
load_in_8bit	`bool`	No (default: False)	Enable 8-bit quantization with LLM.int8().
load_in_4bit	`bool`	No (default: False)	Enable 4-bit quantization by replacing Linear layers with FP4/NF4 layers.
llm_int8_threshold	`float`	No (default: 6.0)	Outlier threshold for LLM.int8() mixed-precision decomposition. Hidden states above this value are computed in fp16.
llm_int8_skip_modules	`list[str]` or `None`	No	Explicit list of module names to exclude from 8-bit quantization (e.g., `["lm_head"]`).
llm_int8_enable_fp32_cpu_offload	`bool`	No (default: False)	Enable splitting the model between GPU (int8) and CPU (fp32).
llm_int8_has_fp16_weight	`bool`	No (default: False)	Use 16-bit main weights for LLM.int8(). Useful for fine-tuning.
bnb_4bit_compute_dtype	`torch.dtype` or `str`	No (default: torch.float32)	Computation dtype for 4-bit quantized layers. Set to `torch.bfloat16` for faster inference.
bnb_4bit_quant_type	`str`	No (default: "fp4")	Quantization data type: `"fp4"` or `"nf4"` (NormalFloat, recommended for QLoRA).
bnb_4bit_use_double_quant	`bool`	No (default: False)	Enable nested quantization of the quantization constants for additional memory savings (~0.4 bits/param).
bnb_4bit_quant_storage	`torch.dtype` or `str`	No (default: torch.uint8)	Storage dtype for packing 4-bit parameters. Must be one of float16, float32, int8, uint8, float64, bfloat16.

Outputs

Name	Type	Description
config	`BitsAndBytesConfig`	An instantiated configuration object ready to be passed to `from_pretrained()`.

Usage Examples

Basic 4-bit Quantization

from transformers import BitsAndBytesConfig

config = BitsAndBytesConfig(load_in_4bit=True)

QLoRA-optimized Configuration

import torch
from transformers import BitsAndBytesConfig

config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

8-bit Quantization with LLM.int8()

from transformers import BitsAndBytesConfig

config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_threshold=6.0,
    llm_int8_skip_modules=["lm_head"],
)

Serialization and Deserialization

import torch
from transformers import BitsAndBytesConfig

config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

# Serialize to dictionary
config_dict = config.to_dict()

# Reconstruct from dictionary
restored_config = BitsAndBytesConfig.from_dict(config_dict)

Related Pages

Implements Principle

Principle:Huggingface_Transformers_Quantization_Configuration

Requires Environment

Environment:Huggingface_Transformers_BitsAndBytes_Quantization_Env

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment