Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Heuristic:Ggml org Llama cpp Quantization Quality Tips

From Leeroopedia
Knowledge Sources
Domains Optimization, Quantization
Last Updated 2026-02-14 22:00 GMT

Overview

Quantization best practices: always use an importance matrix (imatrix), prefer Q4_K_M for best quality-to-size ratio, and never re-quantize already-quantized models.

Description

Model quantization reduces weight precision to shrink model size and improve inference speed, but at the cost of some accuracy. The quality loss depends heavily on the quantization method chosen and whether an importance matrix is used. The importance matrix (generated by llama-imatrix) identifies which weights are most important for model quality, allowing the quantizer to preserve precision where it matters most.

Usage

Use this heuristic when choosing a quantization type for model deployment or when evaluating quantized model quality. This is relevant for the Model Quantization workflow and any deployment scenario where model size vs. quality is a consideration.

The Insight (Rule of Thumb)

  • Action 1: Always generate and use an importance matrix (--imatrix) for quantization. This is highly recommended per the official documentation.
  • Action 2: Use Q4_K_M as the default quantization type for the best balance of quality vs. size.
  • Action 3: Never re-quantize an already-quantized model. Always quantize from F16 or F32 source.
  • Action 4: Use --leave-output-tensor to keep the output layer at full precision for better quality.
  • Trade-off: Higher quality quants (Q6_K, Q8_0) are larger but closer to reference. Lower quants (IQ2_XXS) are much smaller but suffer noticeable quality loss.

Reasoning

The quantize README provides empirical quality benchmarks for Llama 3 8B on Wikitext-2:

Quant Type KL Divergence Model Size Quality Assessment
Q8_0 0.0014 ~8.5 GB Near reference quality
Q6_K 0.0055 ~6.6 GB High quality
Q5_K_M 0.0108 ~5.7 GB Good quality
Q4_K_M (with imatrix) 0.0281 ~4.9 GB Best balance for most use cases
Q4_K_M (no imatrix) 0.0421 ~4.9 GB Noticeable quality loss
IQ2_XXS 0.812 ~2.7 GB Extreme compression, significant quality loss

The documentation explicitly warns about re-quantization from tools/quantize/README.md:48:

--allow-requantize: allows requantizing tensors that have already been quantized.
Warning: This can severely reduce quality compared to quantizing from 16bit or 32bit

And recommends the importance matrix from tools/quantize/README.md:51:

--imatrix: uses data in file generated by llama-imatrix as importance matrix
for quant optimizations (highly recommended)

Memory requirements for full models from tools/quantize/README.md:

Model Size F16 Original Q4_K_M Quantized
8B 32.1 GB 4.9 GB
70B 280.9 GB 43.1 GB
405B 1,625.1 GB 249.1 GB

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment