Principle:LLMBook zh LLMBook zh github io Bitsandbytes Quantization
| Knowledge Sources | |
|---|---|
| Domains | Deep_Learning, Model_Compression, Inference |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
A library-based quantization technique that loads pre-trained models directly in 8-bit or 4-bit precision using the bitsandbytes library.
Description
Bitsandbytes Quantization provides a seamless way to load large language models with reduced precision through HuggingFace's integration with the bitsandbytes library. The 8-bit mode uses LLM.int8() with mixed-precision decomposition (keeping outlier features in fp16), while the 4-bit mode uses NormalFloat4 (NF4) quantization from the QLoRA paper.
The key advantage is simplicity: quantization happens automatically at load time with a single flag (load_in_8bit or load_in_4bit).
Usage
Use this when loading a large model that does not fit in GPU memory at full precision. The 8-bit mode halves memory usage, while 4-bit mode reduces it by 4x with minimal quality loss.
Theoretical Basis
LLM.int8() (8-bit):
- Decompose weight matrices into outlier and non-outlier parts.
- Keep outliers in fp16, quantize non-outliers to int8.
- Multiply separately and sum.
NF4 (4-bit):
- Assumes weights follow a normal distribution.
- Maps to 4-bit values optimally for normal distributions.
- Uses double quantization for additional compression.