Heuristic:Ggml org Ggml Quantization Type Selection

Knowledge Sources	GGML Quantization type map
Domains	Optimization, Model_Compression
Last Updated	2026-02-10 07:40 GMT

Overview

GGML supports 10 quantization types from q2_k to q8_0, with k-quant types (q2_k through q6_k) offering better quality-to-size ratios than legacy types (q4_0, q5_0) at similar compression levels.

Description

GGML provides two families of quantization formats. Legacy types (q4_0, q4_1, q5_0, q5_1, q8_0) use uniform per-block quantization with simple round-to-nearest. K-quant types (q2_k, q3_k, q4_k, q5_k, q6_k) use non-uniform quantization with importance-weighted calibration, achieving better model quality at the same or smaller file sizes. The quantization dispatch system in `ggml_quantize_chunk` routes to the appropriate implementation based on the target type.

Usage

Use this heuristic when choosing a quantization format for model compression. Start with q4_k for a good balance of size and quality. Use q8_0 for highest quality with moderate compression. Use q2_k only when extreme compression is needed and some quality loss is acceptable.

The Insight (Rule of Thumb)

Action: Choose a quantization type based on your quality/size trade-off needs.
Value: Recommended progression from highest quality to smallest size:
- `q8_0`: ~50% size reduction, minimal quality loss (8-bit)
- `q6_k`: ~58% reduction, very good quality
- `q5_k`: ~63% reduction, good quality (recommended default)
- `q4_k`: ~68% reduction, acceptable quality loss
- `q4_0`: ~68% reduction, legacy format (lower quality than q4_k)
- `q3_k`: ~73% reduction, noticeable quality loss
- `q2_k`: ~78% reduction, significant quality loss
Trade-off: K-quant types (q*_k) use importance matrices for better quality at the same bit width compared to legacy types. They require slightly more compute for quantization but not for inference.

Reasoning

The k-quant types introduced in GGML represent a significant advancement over the original uniform quantization:

Importance weighting: K-quant types use per-tensor importance matrices (when available) to allocate more bits to important weights. This is why q4_k outperforms q4_0 despite the same average bit width.
Mixed precision: K-quant blocks use different precision for different components within each block, optimizing the quality/size ratio.
Hardware compatibility: All quantization types have optimized SIMD implementations for x86 (AVX2/AVX-512), ARM (NEON), and other architectures.

Code Evidence

Supported quantization types from `examples/common-ggml.cpp:6-17`:

static const std::map<std::string, enum ggml_ftype> GGML_FTYPE_MAP = {
    {"q4_0", GGML_FTYPE_MOSTLY_Q4_0},
    {"q4_1", GGML_FTYPE_MOSTLY_Q4_1},
    {"q5_0", GGML_FTYPE_MOSTLY_Q5_0},
    {"q5_1", GGML_FTYPE_MOSTLY_Q5_1},
    {"q8_0", GGML_FTYPE_MOSTLY_Q8_0},
    {"q2_k", GGML_FTYPE_MOSTLY_Q2_K},
    {"q3_k", GGML_FTYPE_MOSTLY_Q3_K},
    {"q4_k", GGML_FTYPE_MOSTLY_Q4_K},
    {"q5_k", GGML_FTYPE_MOSTLY_Q5_K},
    {"q6_k", GGML_FTYPE_MOSTLY_Q6_K},
};

Quantization dispatch handling from `src/ggml.c` (ggml_quantize_chunk function):

// The function dispatches to type-specific quantization functions
// based on the target type, supporting both standard and
// importance-matrix-weighted quantization
size_t ggml_quantize_chunk(
    enum ggml_type type,
    const float * src, void * dst,
    int64_t start, int64_t nrows, int64_t n_per_row,
    const float * imatrix);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment