Principle:Ggml org Ggml Model Quantization

Model Quantization

Model Quantization is the principle of reducing model size and inference cost by converting floating-point weights to lower-bit integer representations. Rather than storing each weight as a 32-bit or 16-bit float, quantization maps continuous float values to a smaller set of discrete integer levels, dramatically shrinking memory footprint and accelerating computation.

Theory

Quantization maps continuous floating-point values to discrete integer levels. A scale factor (and optionally a zero-point or minimum value) is stored alongside each block of quantized integers, allowing approximate reconstruction of the original floats during inference. The core trade-off is lower bits = smaller model + faster inference, but reduced accuracy.

Supported Quantization Types

GGML supports a wide range of quantization formats:

Q4_0 -- 4-bit quantization with 32-element blocks and a single scale factor per block
Q4_1 -- 4-bit quantization with scale + minimum value per block
Q5_0 -- 5-bit quantization (symmetric)
Q5_1 -- 5-bit quantization with scale + minimum
Q8_0 -- 8-bit quantization with single scale factor
Q2_K through Q6_K -- k-quant family using super-blocks for improved accuracy at each bit width
IQ types -- importance-weighted quantization that allocates bits non-uniformly based on weight significance

Block Quantization

Weights are grouped into fixed-size blocks (commonly 32 or 256 elements). Each block stores its own scale factor (and optionally a minimum or zero-point), allowing per-block calibration that preserves local value distributions more faithfully than a single global scale.

Selective Quantization

Only 2D tensors (weight matrices) are quantized. 1D tensors such as biases and layer normalization parameters remain in their original floating-point precision, since these are small relative to weight matrices and more sensitive to precision loss.

Importance Matrix

An optional importance matrix can guide quantization to be quality-aware. Not all weights contribute equally to model output; the importance matrix encodes which values matter more, allowing the quantizer to allocate representational precision where it has the greatest impact on output quality.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment