Principle:LLMBook zh LLMBook zh github io Bitsandbytes Quantization

Knowledge Sources	LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale QLoRA: Efficient Finetuning of Quantized Language Models LLMBook-zh
Domains	Deep_Learning, Model_Compression, Inference
Last Updated	2026-02-08 00:00 GMT

Overview

A library-based quantization technique that loads pre-trained models directly in 8-bit or 4-bit precision using the bitsandbytes library.

Description

Bitsandbytes Quantization provides a seamless way to load large language models with reduced precision through HuggingFace's integration with the bitsandbytes library. The 8-bit mode uses LLM.int8() with mixed-precision decomposition (keeping outlier features in fp16), while the 4-bit mode uses NormalFloat4 (NF4) quantization from the QLoRA paper.

The key advantage is simplicity: quantization happens automatically at load time with a single flag (load_in_8bit or load_in_4bit).

Usage

Use this when loading a large model that does not fit in GPU memory at full precision. The 8-bit mode halves memory usage, while 4-bit mode reduces it by 4x with minimal quality loss.

Theoretical Basis

LLM.int8() (8-bit):

Decompose weight matrices into outlier and non-outlier parts.
Keep outliers in fp16, quantize non-outliers to int8.
Multiply separately and sum.

NF4 (4-bit):

Assumes weights follow a normal distribution.
Maps to 4-bit values optimally for normal distributions.
Uses double quantization for additional compression.

Related Pages

Implemented By

Implementation:LLMBook_zh_LLMBook_zh_github_io_AutoModelForCausalLM_From_Pretrained_Bitsandbytes

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment