Heuristic:Ggml org Llama cpp Quantization Quality Tips
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Quantization |
| Last Updated | 2026-02-14 22:00 GMT |
Overview
Quantization best practices: always use an importance matrix (imatrix), prefer Q4_K_M for best quality-to-size ratio, and never re-quantize already-quantized models.
Description
Model quantization reduces weight precision to shrink model size and improve inference speed, but at the cost of some accuracy. The quality loss depends heavily on the quantization method chosen and whether an importance matrix is used. The importance matrix (generated by llama-imatrix) identifies which weights are most important for model quality, allowing the quantizer to preserve precision where it matters most.
Usage
Use this heuristic when choosing a quantization type for model deployment or when evaluating quantized model quality. This is relevant for the Model Quantization workflow and any deployment scenario where model size vs. quality is a consideration.
The Insight (Rule of Thumb)
- Action 1: Always generate and use an importance matrix (
--imatrix) for quantization. This is highly recommended per the official documentation. - Action 2: Use Q4_K_M as the default quantization type for the best balance of quality vs. size.
- Action 3: Never re-quantize an already-quantized model. Always quantize from F16 or F32 source.
- Action 4: Use
--leave-output-tensorto keep the output layer at full precision for better quality. - Trade-off: Higher quality quants (Q6_K, Q8_0) are larger but closer to reference. Lower quants (IQ2_XXS) are much smaller but suffer noticeable quality loss.
Reasoning
The quantize README provides empirical quality benchmarks for Llama 3 8B on Wikitext-2:
| Quant Type | KL Divergence | Model Size | Quality Assessment |
|---|---|---|---|
| Q8_0 | 0.0014 | ~8.5 GB | Near reference quality |
| Q6_K | 0.0055 | ~6.6 GB | High quality |
| Q5_K_M | 0.0108 | ~5.7 GB | Good quality |
| Q4_K_M (with imatrix) | 0.0281 | ~4.9 GB | Best balance for most use cases |
| Q4_K_M (no imatrix) | 0.0421 | ~4.9 GB | Noticeable quality loss |
| IQ2_XXS | 0.812 | ~2.7 GB | Extreme compression, significant quality loss |
The documentation explicitly warns about re-quantization from tools/quantize/README.md:48:
--allow-requantize: allows requantizing tensors that have already been quantized.
Warning: This can severely reduce quality compared to quantizing from 16bit or 32bit
And recommends the importance matrix from tools/quantize/README.md:51:
--imatrix: uses data in file generated by llama-imatrix as importance matrix
for quant optimizations (highly recommended)
Memory requirements for full models from tools/quantize/README.md:
| Model Size | F16 Original | Q4_K_M Quantized |
|---|---|---|
| 8B | 32.1 GB | 4.9 GB |
| 70B | 280.9 GB | 43.1 GB |
| 405B | 1,625.1 GB | 249.1 GB |