Principle:Ggml org Ggml Model Quantization
Model Quantization
Model Quantization is the principle of reducing model size and inference cost by converting floating-point weights to lower-bit integer representations. Rather than storing each weight as a 32-bit or 16-bit float, quantization maps continuous float values to a smaller set of discrete integer levels, dramatically shrinking memory footprint and accelerating computation.
Theory
Quantization maps continuous floating-point values to discrete integer levels. A scale factor (and optionally a zero-point or minimum value) is stored alongside each block of quantized integers, allowing approximate reconstruction of the original floats during inference. The core trade-off is lower bits = smaller model + faster inference, but reduced accuracy.
Supported Quantization Types
GGML supports a wide range of quantization formats:
- Q4_0 -- 4-bit quantization with 32-element blocks and a single scale factor per block
- Q4_1 -- 4-bit quantization with scale + minimum value per block
- Q5_0 -- 5-bit quantization (symmetric)
- Q5_1 -- 5-bit quantization with scale + minimum
- Q8_0 -- 8-bit quantization with single scale factor
- Q2_K through Q6_K -- k-quant family using super-blocks for improved accuracy at each bit width
- IQ types -- importance-weighted quantization that allocates bits non-uniformly based on weight significance
Block Quantization
Weights are grouped into fixed-size blocks (commonly 32 or 256 elements). Each block stores its own scale factor (and optionally a minimum or zero-point), allowing per-block calibration that preserves local value distributions more faithfully than a single global scale.
Selective Quantization
Only 2D tensors (weight matrices) are quantized. 1D tensors such as biases and layer normalization parameters remain in their original floating-point precision, since these are small relative to weight matrices and more sensitive to precision loss.
Importance Matrix
An optional importance matrix can guide quantization to be quality-aware. Not all weights contribute equally to model output; the importance matrix encodes which values matter more, allowing the quantizer to allocate representational precision where it has the greatest impact on output quality.