Implementation:Huggingface Transformers BitsAndBytesConfig
| Knowledge Sources | |
|---|---|
| Domains | Model_Optimization, Quantization, Configuration |
| Last Updated | 2026-02-13 00:00 GMT |
Overview
Concrete API for configuring BitsAndBytes quantization provided by Hugging Face Transformers.
Description
BitsAndBytesConfig is a dataclass that wraps all parameters for the bitsandbytes quantization library. It supports two quantization modes: 8-bit (LLM.int8()) and 4-bit (FP4/NF4). The class validates parameter types and mutual exclusivity constraints in its post_init() method, supports serialization to and from dictionaries and JSON, and integrates with the AutoHfQuantizer dispatcher to instantiate the correct quantizer backend.
The class is defined at line 384 of quantization_config.py and inherits from QuantizationConfigMixin. The quant_method field is automatically set to QuantizationMethod.BITS_AND_BYTES.
Usage
Use this API whenever you want to load a model with BitsAndBytes quantization, whether for memory-efficient inference or as a prerequisite for QLoRA fine-tuning.
Code Reference
Source Location
- Repository: transformers
- File:
src/transformers/utils/quantization_config.py(lines 384-601)
Signature
@dataclass
class BitsAndBytesConfig(QuantizationConfigMixin):
def __init__(
self,
load_in_8bit: bool = False,
load_in_4bit: bool = False,
llm_int8_threshold: float = 6.0,
llm_int8_skip_modules: list[str] | None = None,
llm_int8_enable_fp32_cpu_offload: bool = False,
llm_int8_has_fp16_weight: bool = False,
bnb_4bit_compute_dtype: torch.dtype | str | None = None,
bnb_4bit_quant_type: str = "fp4",
bnb_4bit_use_double_quant: bool = False,
bnb_4bit_quant_storage: torch.dtype | str | None = None,
**kwargs,
): ...
Import
from transformers import BitsAndBytesConfig
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| load_in_8bit | bool |
No (default: False) | Enable 8-bit quantization with LLM.int8(). |
| load_in_4bit | bool |
No (default: False) | Enable 4-bit quantization by replacing Linear layers with FP4/NF4 layers. |
| llm_int8_threshold | float |
No (default: 6.0) | Outlier threshold for LLM.int8() mixed-precision decomposition. Hidden states above this value are computed in fp16. |
| llm_int8_skip_modules | list[str] or None |
No | Explicit list of module names to exclude from 8-bit quantization (e.g., ["lm_head"]).
|
| llm_int8_enable_fp32_cpu_offload | bool |
No (default: False) | Enable splitting the model between GPU (int8) and CPU (fp32). |
| llm_int8_has_fp16_weight | bool |
No (default: False) | Use 16-bit main weights for LLM.int8(). Useful for fine-tuning. |
| bnb_4bit_compute_dtype | torch.dtype or str |
No (default: torch.float32) | Computation dtype for 4-bit quantized layers. Set to torch.bfloat16 for faster inference.
|
| bnb_4bit_quant_type | str |
No (default: "fp4") | Quantization data type: "fp4" or "nf4" (NormalFloat, recommended for QLoRA).
|
| bnb_4bit_use_double_quant | bool |
No (default: False) | Enable nested quantization of the quantization constants for additional memory savings (~0.4 bits/param). |
| bnb_4bit_quant_storage | torch.dtype or str |
No (default: torch.uint8) | Storage dtype for packing 4-bit parameters. Must be one of float16, float32, int8, uint8, float64, bfloat16. |
Outputs
| Name | Type | Description |
|---|---|---|
| config | BitsAndBytesConfig |
An instantiated configuration object ready to be passed to from_pretrained().
|
Usage Examples
Basic 4-bit Quantization
from transformers import BitsAndBytesConfig
config = BitsAndBytesConfig(load_in_4bit=True)
QLoRA-optimized Configuration
import torch
from transformers import BitsAndBytesConfig
config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
8-bit Quantization with LLM.int8()
from transformers import BitsAndBytesConfig
config = BitsAndBytesConfig(
load_in_8bit=True,
llm_int8_threshold=6.0,
llm_int8_skip_modules=["lm_head"],
)
Serialization and Deserialization
import torch
from transformers import BitsAndBytesConfig
config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
# Serialize to dictionary
config_dict = config.to_dict()
# Reconstruct from dictionary
restored_config = BitsAndBytesConfig.from_dict(config_dict)