Environment:LLMBook zh LLMBook zh github io Bitsandbytes Quantization Environment
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Quantization, LLMs |
| Last Updated | 2026-02-08 04:30 GMT |
Overview
Bitsandbytes library environment for 8-bit (LLM.int8()) and 4-bit (NF4) model quantization on NVIDIA GPUs.
Description
This environment provides the bitsandbytes library for post-training quantization of large language models. The codebase demonstrates two quantization modes: 8-bit quantization via load_in_8bit=True and 4-bit quantization via load_in_4bit=True, both integrated through the AutoModelForCausalLM.from_pretrained() interface. GPU memory monitoring confirms the VRAM savings. The GPTQ workflow also requires this environment via GPTQConfig with the auto-gptq backend.
Usage
Use this environment when you need to reduce model memory footprint for inference or training. Required when loading models with load_in_8bit or load_in_4bit flags, or when using GPTQConfig for Hessian-based quantization.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux | bitsandbytes requires Linux for CUDA support |
| Hardware | NVIDIA GPU with CUDA | Required for quantized model operations |
| CUDA | CUDA >= 11.7 | Required by bitsandbytes |
Dependencies
Python Packages
- `bitsandbytes` >= 0.39.0
- `transformers` >= 4.30 (for integrated quantization support)
- `auto-gptq` >= 0.4.0 (for GPTQ quantization only)
Credentials
- `HF_TOKEN`: Hugging Face API token (if using gated models like LLaMA).
Quick Install
# Install bitsandbytes for 8-bit/4-bit quantization
pip install bitsandbytes transformers
# For GPTQ support
pip install auto-gptq
# Verify installation
python -c "import bitsandbytes; print('bitsandbytes installed')"
Code Evidence
8-bit quantization from `code/9.3 bitsandbytes实践.py:6`:
model_8bit = AutoModelForCausalLM.from_pretrained(
name, device_map="auto", load_in_8bit=True
)
4-bit quantization from `code/9.3 bitsandbytes实践.py:11`:
model = AutoModelForCausalLM.from_pretrained(
name, device_map="auto", load_in_4bit=True
)
GPU memory monitoring from `code/9.3 bitsandbytes实践.py:7`:
print(f"memory usage: {torch.cuda.memory_allocated()/1000/1000/1000} GB")
GPTQ configuration from `code/9.4 GPTQ实践.py:7`:
quantization_config = GPTQConfig(
bits=4, dataset="c4", tokenizer=tokenizer
)
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `ImportError: bitsandbytes not found` | bitsandbytes not installed | `pip install bitsandbytes` |
| `RuntimeError: CUDA Setup failed` | CUDA version incompatible with bitsandbytes | Ensure CUDA >= 11.7 and matching bitsandbytes version |
| `CUDA out of memory` during quantization | Insufficient VRAM even for quantized model | Try 4-bit instead of 8-bit quantization |
Compatibility Notes
- Linux Only: bitsandbytes has limited Windows support. Use WSL2 on Windows.
- NVIDIA GPUs Only: bitsandbytes does not support AMD ROCm or Intel XPU.
- GPTQ Calibration: GPTQ requires a calibration dataset (c4 used in example) during quantization.