Implementation:Bitsandbytes foundation Bitsandbytes BitsAndBytesConfig 8bit
Metadata
| Field | Value |
|---|---|
| Sources | Repo: bitsandbytes, Doc: HuggingFace Transformers, Paper: LLM.int8() |
| Domains | Quantization, NLP |
| Type | Wrapper Doc (External Library) |
| Last updated | 2026-02-07 14:00 GMT |
Overview
Concrete tool for configuring 8-bit LLM.int8() quantization parameters provided by the HuggingFace Transformers library.
Description
BitsAndBytesConfig with load_in_8bit=True configures models to use Linear8bitLt layers with INT8 quantization. When passed to a model loading function such as AutoModelForCausalLM.from_pretrained(), all eligible linear layers in the model are replaced with bitsandbytes.nn.Linear8bitLt layers. These layers store weights in INT8 precision and use the LLM.int8() mixed-precision decomposition during inference.
The configuration object encapsulates three key parameters:
load_in_8bit: Enables 8-bit quantization mode.llm_int8_threshold: Sets the outlier detection threshold for the mixed-precision decomposition.llm_int8_has_fp16_weight: Controls whether FP16 weight copies are retained for fine-tuning support.
Code Reference
- Source: External (transformers library)
- Import:
from transformers import BitsAndBytesConfig
- Signature:
transformers.BitsAndBytesConfig(
load_in_8bit: bool = False,
llm_int8_threshold: float = 6.0,
llm_int8_has_fp16_weight: bool = False,
)
I/O Contract
Inputs
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
load_in_8bit |
bool | Yes | False |
Set to True to enable 8-bit LLM.int8() quantization.
|
llm_int8_threshold |
float | No | 6.0 |
Outlier detection threshold. Features with magnitudes exceeding this value are computed in FP16. |
llm_int8_has_fp16_weight |
bool | No | False |
If True, retains FP16 weight copies for fine-tuning. If False, only INT8 weights are stored (inference-only mode).
|
Outputs
| Output | Type | Description |
|---|---|---|
| config | BitsAndBytesConfig |
Configuration object to pass to model loading functions. |
Usage Examples
Load a model with LLM.int8() quantization:
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
# Configure 8-bit quantization with default settings
quantization_config = BitsAndBytesConfig(
load_in_8bit=True,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
)
# Load model with 8-bit quantization
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=quantization_config,
device_map="auto",
)
Load a model for 8-bit fine-tuning:
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
# Keep FP16 weights for fine-tuning
quantization_config = BitsAndBytesConfig(
load_in_8bit=True,
llm_int8_has_fp16_weight=True,
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=quantization_config,
device_map="auto",
)