Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:LLMBook zh LLMBook zh github io Bitsandbytes Quantization

From Leeroopedia


Knowledge Sources
Domains Deep_Learning, Model_Compression, Inference
Last Updated 2026-02-08 00:00 GMT

Overview

A library-based quantization technique that loads pre-trained models directly in 8-bit or 4-bit precision using the bitsandbytes library.

Description

Bitsandbytes Quantization provides a seamless way to load large language models with reduced precision through HuggingFace's integration with the bitsandbytes library. The 8-bit mode uses LLM.int8() with mixed-precision decomposition (keeping outlier features in fp16), while the 4-bit mode uses NormalFloat4 (NF4) quantization from the QLoRA paper.

The key advantage is simplicity: quantization happens automatically at load time with a single flag (load_in_8bit or load_in_4bit).

Usage

Use this when loading a large model that does not fit in GPU memory at full precision. The 8-bit mode halves memory usage, while 4-bit mode reduces it by 4x with minimal quality loss.

Theoretical Basis

LLM.int8() (8-bit):

  1. Decompose weight matrices into outlier and non-outlier parts.
  2. Keep outliers in fp16, quantize non-outliers to int8.
  3. Multiply separately and sum.

NF4 (4-bit):

  1. Assumes weights follow a normal distribution.
  2. Maps to 4-bit values optimally for normal distributions.
  3. Uses double quantization for additional compression.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment