Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Transformers BitsAndBytesConfig

From Leeroopedia
Revision as of 13:05, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Huggingface_Transformers_BitsAndBytesConfig.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains Model_Optimization, Quantization, Configuration
Last Updated 2026-02-13 00:00 GMT

Overview

Concrete API for configuring BitsAndBytes quantization provided by Hugging Face Transformers.

Description

BitsAndBytesConfig is a dataclass that wraps all parameters for the bitsandbytes quantization library. It supports two quantization modes: 8-bit (LLM.int8()) and 4-bit (FP4/NF4). The class validates parameter types and mutual exclusivity constraints in its post_init() method, supports serialization to and from dictionaries and JSON, and integrates with the AutoHfQuantizer dispatcher to instantiate the correct quantizer backend.

The class is defined at line 384 of quantization_config.py and inherits from QuantizationConfigMixin. The quant_method field is automatically set to QuantizationMethod.BITS_AND_BYTES.

Usage

Use this API whenever you want to load a model with BitsAndBytes quantization, whether for memory-efficient inference or as a prerequisite for QLoRA fine-tuning.

Code Reference

Source Location

  • Repository: transformers
  • File: src/transformers/utils/quantization_config.py (lines 384-601)

Signature

@dataclass
class BitsAndBytesConfig(QuantizationConfigMixin):
    def __init__(
        self,
        load_in_8bit: bool = False,
        load_in_4bit: bool = False,
        llm_int8_threshold: float = 6.0,
        llm_int8_skip_modules: list[str] | None = None,
        llm_int8_enable_fp32_cpu_offload: bool = False,
        llm_int8_has_fp16_weight: bool = False,
        bnb_4bit_compute_dtype: torch.dtype | str | None = None,
        bnb_4bit_quant_type: str = "fp4",
        bnb_4bit_use_double_quant: bool = False,
        bnb_4bit_quant_storage: torch.dtype | str | None = None,
        **kwargs,
    ): ...

Import

from transformers import BitsAndBytesConfig

I/O Contract

Inputs

Name Type Required Description
load_in_8bit bool No (default: False) Enable 8-bit quantization with LLM.int8().
load_in_4bit bool No (default: False) Enable 4-bit quantization by replacing Linear layers with FP4/NF4 layers.
llm_int8_threshold float No (default: 6.0) Outlier threshold for LLM.int8() mixed-precision decomposition. Hidden states above this value are computed in fp16.
llm_int8_skip_modules list[str] or None No Explicit list of module names to exclude from 8-bit quantization (e.g., ["lm_head"]).
llm_int8_enable_fp32_cpu_offload bool No (default: False) Enable splitting the model between GPU (int8) and CPU (fp32).
llm_int8_has_fp16_weight bool No (default: False) Use 16-bit main weights for LLM.int8(). Useful for fine-tuning.
bnb_4bit_compute_dtype torch.dtype or str No (default: torch.float32) Computation dtype for 4-bit quantized layers. Set to torch.bfloat16 for faster inference.
bnb_4bit_quant_type str No (default: "fp4") Quantization data type: "fp4" or "nf4" (NormalFloat, recommended for QLoRA).
bnb_4bit_use_double_quant bool No (default: False) Enable nested quantization of the quantization constants for additional memory savings (~0.4 bits/param).
bnb_4bit_quant_storage torch.dtype or str No (default: torch.uint8) Storage dtype for packing 4-bit parameters. Must be one of float16, float32, int8, uint8, float64, bfloat16.

Outputs

Name Type Description
config BitsAndBytesConfig An instantiated configuration object ready to be passed to from_pretrained().

Usage Examples

Basic 4-bit Quantization

from transformers import BitsAndBytesConfig

config = BitsAndBytesConfig(load_in_4bit=True)

QLoRA-optimized Configuration

import torch
from transformers import BitsAndBytesConfig

config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

8-bit Quantization with LLM.int8()

from transformers import BitsAndBytesConfig

config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_threshold=6.0,
    llm_int8_skip_modules=["lm_head"],
)

Serialization and Deserialization

import torch
from transformers import BitsAndBytesConfig

config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

# Serialize to dictionary
config_dict = config.to_dict()

# Reconstruct from dictionary
restored_config = BitsAndBytesConfig.from_dict(config_dict)

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment