Heuristic:Spcl Graph of thoughts Four Bit Quantization For Local LLMs
| Knowledge Sources | |
|---|---|
| Domains | LLM_Reasoning, Optimization, GPU_Computing |
| Last Updated | 2026-02-14 03:30 GMT |
Overview
Memory optimization technique using NF4 4-bit quantization with double quantization and bfloat16 compute to run LLaMA-2 models on consumer GPUs.
Description
The Llama2HF language model backend uses BitsAndBytes 4-bit quantization with the NF4 (Normal Float 4) quantization type and double quantization to drastically reduce the GPU memory footprint of LLaMA-2 models. This allows a 7B parameter model to run on a single consumer GPU (6-8GB VRAM) instead of requiring 14-28GB in full precision. The compute dtype is set to `torch.bfloat16` for the dequantization step, balancing speed and numerical stability. Combined with `device_map="auto"` from the Accelerate library, larger models (13B, 70B) are automatically split across multiple GPUs.
Usage
Use this heuristic when:
- Running LLaMA-2 models locally instead of using the OpenAI API
- GPU VRAM is limited (consumer GPUs with 8-24GB)
- You need to run 13B or 70B models that would not fit in full precision
- Cost is a concern and you want to avoid per-token API charges
The Insight (Rule of Thumb)
- Action: Configure `BitsAndBytesConfig` with `load_in_4bit=True`, `bnb_4bit_quant_type="nf4"`, `bnb_4bit_use_double_quant=True`, and `bnb_4bit_compute_dtype=torch.bfloat16`.
- Value: Reduces model memory by ~4x (from ~14GB to ~4GB for 7B model, from ~26GB to ~7GB for 13B).
- Trade-off: Slight quality degradation compared to full precision. NF4 is optimized for normally distributed weights (common in LLMs) and double quantization further compresses the quantization constants.
- Additional settings: Set `model.eval()` and `torch.no_grad()` to disable gradients and reduce memory overhead during inference.
Reasoning
LLaMA-2 models in full precision (float32) require approximately 4 bytes per parameter:
- 7B model: ~28GB VRAM
- 13B model: ~52GB VRAM
- 70B model: ~280GB VRAM
Even in float16, the 7B model requires ~14GB, exceeding many consumer GPUs. NF4 quantization reduces this to ~1 byte per parameter:
- 7B model: ~4GB VRAM (fits on RTX 3060/4060)
- 13B model: ~7GB VRAM (fits on RTX 3080/4070)
- 70B model: ~35GB VRAM (requires 2x RTX 3090/4090 or A100)
The `nf4` quantization type is specifically designed for normally distributed weights, which is the typical distribution in pretrained language models. Double quantization applies a second level of quantization to the quantization constants themselves, saving an additional ~0.4 bits per parameter.
Code Evidence
BitsAndBytes configuration from `graph_of_thoughts/language_models/llamachat_hf.py:54-59`:
bnb_config = transformers.BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16,
)
Auto device mapping from `graph_of_thoughts/language_models/llamachat_hf.py:62-68`:
self.model = transformers.AutoModelForCausalLM.from_pretrained(
hf_model_id,
trust_remote_code=True,
config=model_config,
quantization_config=bnb_config,
device_map="auto",
)
Inference mode from `graph_of_thoughts/language_models/llamachat_hf.py:69-70`:
self.model.eval()
torch.no_grad()
Cache directory setup from `graph_of_thoughts/language_models/llamachat_hf.py:48-49`:
# Important: must be done before importing transformers
os.environ["TRANSFORMERS_CACHE"] = self.config["cache_dir"]