Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:LLMBook zh LLMBook zh github io Bitsandbytes Quantization Environment

From Leeroopedia


Knowledge Sources
Domains Infrastructure, Quantization, LLMs
Last Updated 2026-02-08 04:30 GMT

Overview

Bitsandbytes library environment for 8-bit (LLM.int8()) and 4-bit (NF4) model quantization on NVIDIA GPUs.

Description

This environment provides the bitsandbytes library for post-training quantization of large language models. The codebase demonstrates two quantization modes: 8-bit quantization via load_in_8bit=True and 4-bit quantization via load_in_4bit=True, both integrated through the AutoModelForCausalLM.from_pretrained() interface. GPU memory monitoring confirms the VRAM savings. The GPTQ workflow also requires this environment via GPTQConfig with the auto-gptq backend.

Usage

Use this environment when you need to reduce model memory footprint for inference or training. Required when loading models with load_in_8bit or load_in_4bit flags, or when using GPTQConfig for Hessian-based quantization.

System Requirements

Category Requirement Notes
OS Linux bitsandbytes requires Linux for CUDA support
Hardware NVIDIA GPU with CUDA Required for quantized model operations
CUDA CUDA >= 11.7 Required by bitsandbytes

Dependencies

Python Packages

  • `bitsandbytes` >= 0.39.0
  • `transformers` >= 4.30 (for integrated quantization support)
  • `auto-gptq` >= 0.4.0 (for GPTQ quantization only)

Credentials

  • `HF_TOKEN`: Hugging Face API token (if using gated models like LLaMA).

Quick Install

# Install bitsandbytes for 8-bit/4-bit quantization
pip install bitsandbytes transformers

# For GPTQ support
pip install auto-gptq

# Verify installation
python -c "import bitsandbytes; print('bitsandbytes installed')"

Code Evidence

8-bit quantization from `code/9.3 bitsandbytes实践.py:6`:

model_8bit = AutoModelForCausalLM.from_pretrained(
    name, device_map="auto", load_in_8bit=True
)

4-bit quantization from `code/9.3 bitsandbytes实践.py:11`:

model = AutoModelForCausalLM.from_pretrained(
    name, device_map="auto", load_in_4bit=True
)

GPU memory monitoring from `code/9.3 bitsandbytes实践.py:7`:

print(f"memory usage: {torch.cuda.memory_allocated()/1000/1000/1000} GB")

GPTQ configuration from `code/9.4 GPTQ实践.py:7`:

quantization_config = GPTQConfig(
    bits=4, dataset="c4", tokenizer=tokenizer
)

Common Errors

Error Message Cause Solution
`ImportError: bitsandbytes not found` bitsandbytes not installed `pip install bitsandbytes`
`RuntimeError: CUDA Setup failed` CUDA version incompatible with bitsandbytes Ensure CUDA >= 11.7 and matching bitsandbytes version
`CUDA out of memory` during quantization Insufficient VRAM even for quantized model Try 4-bit instead of 8-bit quantization

Compatibility Notes

  • Linux Only: bitsandbytes has limited Windows support. Use WSL2 on Windows.
  • NVIDIA GPUs Only: bitsandbytes does not support AMD ROCm or Intel XPU.
  • GPTQ Calibration: GPTQ requires a calibration dataset (c4 used in example) during quantization.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment