Heuristic:Ggml org Llama cpp Quantization Quality Tips

Knowledge Sources	Quantize README llama.cpp
Domains	Optimization, Quantization
Last Updated	2026-02-14 22:00 GMT

Overview

Quantization best practices: always use an importance matrix (imatrix), prefer Q4_K_M for best quality-to-size ratio, and never re-quantize already-quantized models.

Description

Model quantization reduces weight precision to shrink model size and improve inference speed, but at the cost of some accuracy. The quality loss depends heavily on the quantization method chosen and whether an importance matrix is used. The importance matrix (generated by llama-imatrix) identifies which weights are most important for model quality, allowing the quantizer to preserve precision where it matters most.

Usage

Use this heuristic when choosing a quantization type for model deployment or when evaluating quantized model quality. This is relevant for the Model Quantization workflow and any deployment scenario where model size vs. quality is a consideration.

The Insight (Rule of Thumb)

Action 1: Always generate and use an importance matrix (--imatrix) for quantization. This is highly recommended per the official documentation.
Action 2: Use Q4_K_M as the default quantization type for the best balance of quality vs. size.
Action 3: Never re-quantize an already-quantized model. Always quantize from F16 or F32 source.
Action 4: Use --leave-output-tensor to keep the output layer at full precision for better quality.
Trade-off: Higher quality quants (Q6_K, Q8_0) are larger but closer to reference. Lower quants (IQ2_XXS) are much smaller but suffer noticeable quality loss.

Reasoning

The quantize README provides empirical quality benchmarks for Llama 3 8B on Wikitext-2:

Quant Type	KL Divergence	Model Size	Quality Assessment
Q8_0	0.0014	~8.5 GB	Near reference quality
Q6_K	0.0055	~6.6 GB	High quality
Q5_K_M	0.0108	~5.7 GB	Good quality
Q4_K_M (with imatrix)	0.0281	~4.9 GB	Best balance for most use cases
Q4_K_M (no imatrix)	0.0421	~4.9 GB	Noticeable quality loss
IQ2_XXS	0.812	~2.7 GB	Extreme compression, significant quality loss

The documentation explicitly warns about re-quantization from tools/quantize/README.md:48:

--allow-requantize: allows requantizing tensors that have already been quantized.
Warning: This can severely reduce quality compared to quantizing from 16bit or 32bit

And recommends the importance matrix from tools/quantize/README.md:51:

--imatrix: uses data in file generated by llama-imatrix as importance matrix
for quant optimizations (highly recommended)

Memory requirements for full models from tools/quantize/README.md:

Model Size	F16 Original	Q4_K_M Quantized
8B	32.1 GB	4.9 GB
70B	280.9 GB	43.1 GB
405B	1,625.1 GB	249.1 GB

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment