Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Spcl Graph of thoughts Four Bit Quantization For Local LLMs

From Leeroopedia
Knowledge Sources
Domains LLM_Reasoning, Optimization, GPU_Computing
Last Updated 2026-02-14 03:30 GMT

Overview

Memory optimization technique using NF4 4-bit quantization with double quantization and bfloat16 compute to run LLaMA-2 models on consumer GPUs.

Description

The Llama2HF language model backend uses BitsAndBytes 4-bit quantization with the NF4 (Normal Float 4) quantization type and double quantization to drastically reduce the GPU memory footprint of LLaMA-2 models. This allows a 7B parameter model to run on a single consumer GPU (6-8GB VRAM) instead of requiring 14-28GB in full precision. The compute dtype is set to `torch.bfloat16` for the dequantization step, balancing speed and numerical stability. Combined with `device_map="auto"` from the Accelerate library, larger models (13B, 70B) are automatically split across multiple GPUs.

Usage

Use this heuristic when:

  • Running LLaMA-2 models locally instead of using the OpenAI API
  • GPU VRAM is limited (consumer GPUs with 8-24GB)
  • You need to run 13B or 70B models that would not fit in full precision
  • Cost is a concern and you want to avoid per-token API charges

The Insight (Rule of Thumb)

  • Action: Configure `BitsAndBytesConfig` with `load_in_4bit=True`, `bnb_4bit_quant_type="nf4"`, `bnb_4bit_use_double_quant=True`, and `bnb_4bit_compute_dtype=torch.bfloat16`.
  • Value: Reduces model memory by ~4x (from ~14GB to ~4GB for 7B model, from ~26GB to ~7GB for 13B).
  • Trade-off: Slight quality degradation compared to full precision. NF4 is optimized for normally distributed weights (common in LLMs) and double quantization further compresses the quantization constants.
  • Additional settings: Set `model.eval()` and `torch.no_grad()` to disable gradients and reduce memory overhead during inference.

Reasoning

LLaMA-2 models in full precision (float32) require approximately 4 bytes per parameter:

  • 7B model: ~28GB VRAM
  • 13B model: ~52GB VRAM
  • 70B model: ~280GB VRAM

Even in float16, the 7B model requires ~14GB, exceeding many consumer GPUs. NF4 quantization reduces this to ~1 byte per parameter:

  • 7B model: ~4GB VRAM (fits on RTX 3060/4060)
  • 13B model: ~7GB VRAM (fits on RTX 3080/4070)
  • 70B model: ~35GB VRAM (requires 2x RTX 3090/4090 or A100)

The `nf4` quantization type is specifically designed for normally distributed weights, which is the typical distribution in pretrained language models. Double quantization applies a second level of quantization to the quantization constants themselves, saving an additional ~0.4 bits per parameter.

Code Evidence

BitsAndBytes configuration from `graph_of_thoughts/language_models/llamachat_hf.py:54-59`:

bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)

Auto device mapping from `graph_of_thoughts/language_models/llamachat_hf.py:62-68`:

self.model = transformers.AutoModelForCausalLM.from_pretrained(
    hf_model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    device_map="auto",
)

Inference mode from `graph_of_thoughts/language_models/llamachat_hf.py:69-70`:

self.model.eval()
torch.no_grad()

Cache directory setup from `graph_of_thoughts/language_models/llamachat_hf.py:48-49`:

# Important: must be done before importing transformers
os.environ["TRANSFORMERS_CACHE"] = self.config["cache_dir"]

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment