Implementation:Ggml org Ggml Quants api

**Implementation Metadata**
File Name	`src/ggml-quants.h`
Repository	ggml-org/ggml
Lines	106
Language	C
Domain Tags	ML_Infrastructure, Quantization, API_Design
Status	Active
Last Updated	2025-05-15 12:00 GMT
Knowledge Sources	ggml-org/ggml repository

Overview

src/ggml-quants.h is the header declaring all quantization, dequantization, and importance-matrix quantization function prototypes for GGML's supported quantized formats. It serves as the public contract between the quantization implementation (ggml-quants.c) and its consumers (the CPU backend, model loading/saving code).

Description

The file declares three categories of functions:

1. Reference quantization (quantize_row_*_ref) -- Deterministic float32-to-quantized-block conversion:

Standard formats: q4_0, q4_1, q5_0, q5_1, q8_0, q8_1
K-quant formats: q2_K, q3_K, q4_K, q5_K, q6_K, q8_K
Special formats: mxfp4, tq1_0, tq2_0
IQ formats: iq3_xxs, iq4_nl, iq4_xs, iq3_s, iq2_s

2. Dequantization (dequantize_row_*) -- Quantized-block-to-float32 conversion for all formats including the full IQ family (iq2_xxs, iq2_xs, iq2_s, iq3_xxs, iq1_s, iq1_m, iq4_nl, iq4_xs, iq3_s).

3. Importance-matrix quantization (quantize_*) -- Higher-level functions accepting an imatrix parameter for activation-aware quantization (AWQ), producing better quality at the same bit rate.

Also declares init/free functions for iq2xs and iq3xs lookup tables. All functions are marked GGML_API since they are used by the CPU backend.

Usage

#include "ggml-quants.h"

// Quantize float data to Q4_0
float input[256];
block_q4_0 output[256 / QK4_0];
quantize_row_q4_0_ref(input, output, 256);

// Dequantize back to float
float restored[256];
dequantize_row_q4_0(output, restored, 256);

// Quantize with importance matrix
size_t bytes = quantize_q4_0(src, dst, nrows, n_per_row, imatrix);

Code Reference

Source Location

Repository	File	Lines
ggml-org/ggml	`src/ggml-quants.h`	106

Key Signatures

// Reference quantization (float32 -> quantized block)
GGML_API void quantize_row_q4_0_ref(const float * GGML_RESTRICT x, block_q4_0 * GGML_RESTRICT y, int64_t k);
GGML_API void quantize_row_q8_0_ref(const float * GGML_RESTRICT x, block_q8_0 * GGML_RESTRICT y, int64_t k);
GGML_API void quantize_row_mxfp4_ref(const float * GGML_RESTRICT x, block_mxfp4 * GGML_RESTRICT y, int64_t k);

// Dequantization (quantized block -> float32)
GGML_API void dequantize_row_q4_0(const block_q4_0 * GGML_RESTRICT x, float * GGML_RESTRICT y, int64_t k);
GGML_API void dequantize_row_q8_0(const block_q8_0 * GGML_RESTRICT x, float * GGML_RESTRICT y, int64_t k);
GGML_API void dequantize_row_iq4_nl(const block_iq4_nl * GGML_RESTRICT x, float * GGML_RESTRICT y, int64_t k);

// Importance-matrix aware quantization (AWQ)
GGML_API size_t quantize_q4_0(const float * GGML_RESTRICT src, void * GGML_RESTRICT dst,
    int64_t nrows, int64_t n_per_row, const float * imatrix);
GGML_API size_t quantize_iq4_nl(const float * GGML_RESTRICT src, void * GGML_RESTRICT dst,
    int64_t nrows, int64_t n_per_row, const float * imatrix);

// Lookup table management for IQ formats
GGML_API void iq2xs_init_impl(enum ggml_type type);
GGML_API void iq2xs_free_impl(enum ggml_type type);

I/O Contract

Inputs

Float data -- Source float32 array for quantization
Quantized blocks -- Source quantized block array for dequantization
k -- Number of elements to process
imatrix -- Optional importance matrix for AWQ quantization (can be NULL)

Outputs

Quantized blocks -- Packed quantized representation
Float data -- Dequantized float32 values
Size -- Number of bytes written by importance-matrix quantization functions

Usage Examples

Quantization round-trip:

#include "ggml-quants.h"

// Quantize 256 floats to Q4_0 (4-bit quantization)
float src[256] = { /* ... */ };
block_q4_0 quantized[256 / QK4_0];
quantize_row_q4_0_ref(src, quantized, 256);

// Dequantize back to float
float dst[256];
dequantize_row_q4_0(quantized, dst, 256);

// AWQ quantization with importance matrix
float imatrix[256] = { /* activation importance weights */ };
size_t nbytes = quantize_q4_0(src, quantized, 1, 256, imatrix);

Related Pages

Implements Principle

Principle:Ggml_org_Ggml_Backend_Interface

Related Implementations

Implementation:Ggml_org_Ggml_Ggml_quantize_chunk -- Higher-level quantization API
Implementation:Ggml_org_Ggml_Python_utils -- Python bindings for quantization
Implementation:Ggml_org_Ggml_Hexagon_matmul_ops -- Hexagon uses quantized formats

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment