Principle:Ggml org Ggml Quantization API

Knowledge Sources	GGML
Domains	Quantization, API
Last Updated	2026-02-10

Overview

The Quantization API provides a unified function interface for quantizing and dequantizing tensor data across all supported quantization formats in GGML.

Description

GGML supports over 30 quantization formats, ranging from simple 4-bit and 8-bit block quantization (Q4_0, Q4_1, Q5_0, Q5_1, Q8_0) through the k-quant family (Q2_K through Q8_K) to importance-weighted quantization types (IQ1_S, IQ2_XXS, IQ3_XXS, IQ4_NL, etc.) and specialized formats like ternary quantization (TQ1_0, TQ2_0) and microscaling (MXFP4). The Quantization API provides a consistent, type-dispatched interface for converting between float32 data and any of these compressed representations.

The API is organized into three categories of functions, all declared in src/ggml-quants.h:

1. Reference Quantization (quantize_row_*_ref) -- These functions quantize a row of float32 values into a specific block format. They serve as the reference (scalar) implementation for each type and are used both as fallbacks and for correctness testing. Each function takes a float pointer, a typed block pointer, and an element count.

2. Dequantization (dequantize_row_*) -- These functions convert a row of quantized blocks back to float32. They are the inverse of the quantization functions and are used during inference when an operation requires float inputs but the weights are stored in a quantized format.

3. Importance-Aware Quantization (quantize_*) -- These higher-level functions accept an optional importance matrix that guides quantization to preserve accuracy for weights that have greater impact on model output. They operate on multi-row data and are used during model conversion and optimization.

The API is unified through the type traits system defined in ggml.h. The ggml_type_traits struct contains function pointers -- to_float (of type ggml_to_float_t) and from_float_ref (of type ggml_from_float_t) -- that are populated for each type. Code that needs to quantize or dequantize data without knowing the specific type at compile time can look up the traits via ggml_get_type_traits(type) and call through the function pointers. This is how the Python interop layer, the model quantization tools, and the CPU backend all perform type-generic quantization.

Usage

Apply this principle when implementing new quantization formats, building model conversion tools, or writing backend code that must handle arbitrary tensor types. To add a new quantization format, implement the quantize_row_*_ref and dequantize_row_* functions in ggml-quants.c, register the function pointers in the type traits table, and optionally provide an importance-aware quantize_* variant. Consumers of the API should use the type traits system rather than calling specific functions directly, to remain compatible with all current and future quantization types.

Theoretical Basis

The Quantization API is grounded in several design principles:

Block Quantization -- All GGML quantization formats operate on fixed-size blocks (commonly 32 or 256 elements). Each block stores a scale factor (and optionally a minimum or zero-point) alongside the quantized integer values. This per-block calibration preserves local value distributions more faithfully than a single global scale, achieving a favorable trade-off between compression ratio and accuracy.
Type-Dispatched Function Pointers -- The ggml_type_traits system provides a uniform dispatch mechanism analogous to a virtual method table. Any code path that needs to convert between float and quantized representations can do so through a single interface, regardless of the specific quantization format. This eliminates large switch statements and makes the system extensible: adding a new type requires only adding entries to the traits table, not modifying every call site.
Separation of Reference and Optimized Implementations -- The _ref suffix on reference quantization functions distinguishes them from optimized variants. The CPU backend provides SIMD-accelerated implementations (using AVX2, AVX-512, NEON, etc.) that override the reference functions for performance-critical paths. The reference implementations remain available for correctness testing and for platforms without SIMD support.
Importance-Weighted Quantization -- The quantize_* functions optionally accept an importance matrix (sometimes called an activation-aware weight matrix). This enables non-uniform bit allocation: weights that contribute more to model output receive more precise quantization. The importance matrix is typically computed from calibration data by analyzing activation magnitudes during inference on representative inputs.
Symmetric API Design -- Every quantization function has a corresponding dequantization function, and the type traits table stores both directions. This symmetry ensures that any data that can be quantized can also be dequantized, which is essential for the Python interop layer, debugging tools, and any pipeline that needs to inspect or transform quantized weights.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment