Heuristic:Ggml org Ggml Quantization Type Selection
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Model_Compression |
| Last Updated | 2026-02-10 07:40 GMT |
Overview
GGML supports 10 quantization types from q2_k to q8_0, with k-quant types (q2_k through q6_k) offering better quality-to-size ratios than legacy types (q4_0, q5_0) at similar compression levels.
Description
GGML provides two families of quantization formats. Legacy types (q4_0, q4_1, q5_0, q5_1, q8_0) use uniform per-block quantization with simple round-to-nearest. K-quant types (q2_k, q3_k, q4_k, q5_k, q6_k) use non-uniform quantization with importance-weighted calibration, achieving better model quality at the same or smaller file sizes. The quantization dispatch system in `ggml_quantize_chunk` routes to the appropriate implementation based on the target type.
Usage
Use this heuristic when choosing a quantization format for model compression. Start with q4_k for a good balance of size and quality. Use q8_0 for highest quality with moderate compression. Use q2_k only when extreme compression is needed and some quality loss is acceptable.
The Insight (Rule of Thumb)
- Action: Choose a quantization type based on your quality/size trade-off needs.
- Value: Recommended progression from highest quality to smallest size:
- `q8_0`: ~50% size reduction, minimal quality loss (8-bit)
- `q6_k`: ~58% reduction, very good quality
- `q5_k`: ~63% reduction, good quality (recommended default)
- `q4_k`: ~68% reduction, acceptable quality loss
- `q4_0`: ~68% reduction, legacy format (lower quality than q4_k)
- `q3_k`: ~73% reduction, noticeable quality loss
- `q2_k`: ~78% reduction, significant quality loss
- Trade-off: K-quant types (q*_k) use importance matrices for better quality at the same bit width compared to legacy types. They require slightly more compute for quantization but not for inference.
Reasoning
The k-quant types introduced in GGML represent a significant advancement over the original uniform quantization:
- Importance weighting: K-quant types use per-tensor importance matrices (when available) to allocate more bits to important weights. This is why q4_k outperforms q4_0 despite the same average bit width.
- Mixed precision: K-quant blocks use different precision for different components within each block, optimizing the quality/size ratio.
- Hardware compatibility: All quantization types have optimized SIMD implementations for x86 (AVX2/AVX-512), ARM (NEON), and other architectures.
Code Evidence
Supported quantization types from `examples/common-ggml.cpp:6-17`:
static const std::map<std::string, enum ggml_ftype> GGML_FTYPE_MAP = {
{"q4_0", GGML_FTYPE_MOSTLY_Q4_0},
{"q4_1", GGML_FTYPE_MOSTLY_Q4_1},
{"q5_0", GGML_FTYPE_MOSTLY_Q5_0},
{"q5_1", GGML_FTYPE_MOSTLY_Q5_1},
{"q8_0", GGML_FTYPE_MOSTLY_Q8_0},
{"q2_k", GGML_FTYPE_MOSTLY_Q2_K},
{"q3_k", GGML_FTYPE_MOSTLY_Q3_K},
{"q4_k", GGML_FTYPE_MOSTLY_Q4_K},
{"q5_k", GGML_FTYPE_MOSTLY_Q5_K},
{"q6_k", GGML_FTYPE_MOSTLY_Q6_K},
};
Quantization dispatch handling from `src/ggml.c` (ggml_quantize_chunk function):
// The function dispatches to type-specific quantization functions
// based on the target type, supporting both standard and
// importance-matrix-weighted quantization
size_t ggml_quantize_chunk(
enum ggml_type type,
const float * src, void * dst,
int64_t start, int64_t nrows, int64_t n_per_row,
const float * imatrix);