Environment:LLMBook zh LLMBook zh github io Bitsandbytes Quantization Environment

Knowledge Sources	LLMBook-zh bitsandbytes
Domains	Infrastructure, Quantization, LLMs
Last Updated	2026-02-08 04:30 GMT

Overview

Bitsandbytes library environment for 8-bit (LLM.int8()) and 4-bit (NF4) model quantization on NVIDIA GPUs.

Description

This environment provides the bitsandbytes library for post-training quantization of large language models. The codebase demonstrates two quantization modes: 8-bit quantization via load_in_8bit=True and 4-bit quantization via load_in_4bit=True, both integrated through the AutoModelForCausalLM.from_pretrained() interface. GPU memory monitoring confirms the VRAM savings. The GPTQ workflow also requires this environment via GPTQConfig with the auto-gptq backend.

Usage

Use this environment when you need to reduce model memory footprint for inference or training. Required when loading models with load_in_8bit or load_in_4bit flags, or when using GPTQConfig for Hessian-based quantization.

System Requirements

Category	Requirement	Notes
OS	Linux	bitsandbytes requires Linux for CUDA support
Hardware	NVIDIA GPU with CUDA	Required for quantized model operations
CUDA	CUDA >= 11.7	Required by bitsandbytes

Dependencies

Python Packages

`bitsandbytes` >= 0.39.0
`transformers` >= 4.30 (for integrated quantization support)
`auto-gptq` >= 0.4.0 (for GPTQ quantization only)

Credentials

`HF_TOKEN`: Hugging Face API token (if using gated models like LLaMA).

Quick Install

# Install bitsandbytes for 8-bit/4-bit quantization
pip install bitsandbytes transformers

# For GPTQ support
pip install auto-gptq

# Verify installation
python -c "import bitsandbytes; print('bitsandbytes installed')"

Code Evidence

8-bit quantization from `code/9.3 bitsandbytes实践.py:6`:

model_8bit = AutoModelForCausalLM.from_pretrained(
    name, device_map="auto", load_in_8bit=True
)

4-bit quantization from `code/9.3 bitsandbytes实践.py:11`:

model = AutoModelForCausalLM.from_pretrained(
    name, device_map="auto", load_in_4bit=True
)

GPU memory monitoring from `code/9.3 bitsandbytes实践.py:7`:

print(f"memory usage: {torch.cuda.memory_allocated()/1000/1000/1000} GB")

GPTQ configuration from `code/9.4 GPTQ实践.py:7`:

quantization_config = GPTQConfig(
    bits=4, dataset="c4", tokenizer=tokenizer
)

Common Errors

Error Message	Cause	Solution
`ImportError: bitsandbytes not found`	bitsandbytes not installed	`pip install bitsandbytes`
`RuntimeError: CUDA Setup failed`	CUDA version incompatible with bitsandbytes	Ensure CUDA >= 11.7 and matching bitsandbytes version
`CUDA out of memory` during quantization	Insufficient VRAM even for quantized model	Try 4-bit instead of 8-bit quantization

Compatibility Notes

Linux Only: bitsandbytes has limited Windows support. Use WSL2 on Windows.
NVIDIA GPUs Only: bitsandbytes does not support AMD ROCm or Intel XPU.
GPTQ Calibration: GPTQ requires a calibration dataset (c4 used in example) during quantization.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment