Implementation:Bitsandbytes foundation Bitsandbytes BitsAndBytesConfig 8bit

Metadata

Field	Value
Sources	Repo: bitsandbytes, Doc: HuggingFace Transformers, Paper: LLM.int8()
Domains	Quantization, NLP
Type	Wrapper Doc (External Library)
Last updated	2026-02-07 14:00 GMT

Overview

Concrete tool for configuring 8-bit LLM.int8() quantization parameters provided by the HuggingFace Transformers library.

Description

BitsAndBytesConfig with load_in_8bit=True configures models to use Linear8bitLt layers with INT8 quantization. When passed to a model loading function such as AutoModelForCausalLM.from_pretrained(), all eligible linear layers in the model are replaced with bitsandbytes.nn.Linear8bitLt layers. These layers store weights in INT8 precision and use the LLM.int8() mixed-precision decomposition during inference.

The configuration object encapsulates three key parameters:

load_in_8bit: Enables 8-bit quantization mode.
llm_int8_threshold: Sets the outlier detection threshold for the mixed-precision decomposition.
llm_int8_has_fp16_weight: Controls whether FP16 weight copies are retained for fine-tuning support.

Code Reference

Source: External (transformers library)
Import:

from transformers import BitsAndBytesConfig

Signature:

transformers.BitsAndBytesConfig(
    load_in_8bit: bool = False,
    llm_int8_threshold: float = 6.0,
    llm_int8_has_fp16_weight: bool = False,
)

I/O Contract

Inputs

Parameter	Type	Required	Default	Description
`load_in_8bit`	bool	Yes	`False`	Set to `True` to enable 8-bit LLM.int8() quantization.
`llm_int8_threshold`	float	No	`6.0`	Outlier detection threshold. Features with magnitudes exceeding this value are computed in FP16.
`llm_int8_has_fp16_weight`	bool	No	`False`	If `True`, retains FP16 weight copies for fine-tuning. If `False`, only INT8 weights are stored (inference-only mode).

Outputs

Output	Type	Description
config	`BitsAndBytesConfig`	Configuration object to pass to model loading functions.

Usage Examples

Load a model with LLM.int8() quantization:

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

# Configure 8-bit quantization with default settings
quantization_config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_threshold=6.0,
    llm_int8_has_fp16_weight=False,
)

# Load model with 8-bit quantization
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=quantization_config,
    device_map="auto",
)

Load a model for 8-bit fine-tuning:

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

# Keep FP16 weights for fine-tuning
quantization_config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_has_fp16_weight=True,
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=quantization_config,
    device_map="auto",
)

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment