Implementation:Ggml org Ggml Quants api
| File Name | src/ggml-quants.h
|
| Repository | ggml-org/ggml |
| Lines | 106 |
| Language | C |
| Domain Tags | ML_Infrastructure, Quantization, API_Design |
| Status | Active |
| Last Updated | 2025-05-15 12:00 GMT |
| Knowledge Sources | ggml-org/ggml repository |
Overview
src/ggml-quants.h is the header declaring all quantization, dequantization, and importance-matrix quantization function prototypes for GGML's supported quantized formats. It serves as the public contract between the quantization implementation (ggml-quants.c) and its consumers (the CPU backend, model loading/saving code).
Description
The file declares three categories of functions:
1. Reference quantization (quantize_row_*_ref) -- Deterministic float32-to-quantized-block conversion:
- Standard formats: q4_0, q4_1, q5_0, q5_1, q8_0, q8_1
- K-quant formats: q2_K, q3_K, q4_K, q5_K, q6_K, q8_K
- Special formats: mxfp4, tq1_0, tq2_0
- IQ formats: iq3_xxs, iq4_nl, iq4_xs, iq3_s, iq2_s
2. Dequantization (dequantize_row_*) -- Quantized-block-to-float32 conversion for all formats including the full IQ family (iq2_xxs, iq2_xs, iq2_s, iq3_xxs, iq1_s, iq1_m, iq4_nl, iq4_xs, iq3_s).
3. Importance-matrix quantization (quantize_*) -- Higher-level functions accepting an imatrix parameter for activation-aware quantization (AWQ), producing better quality at the same bit rate.
Also declares init/free functions for iq2xs and iq3xs lookup tables. All functions are marked GGML_API since they are used by the CPU backend.
Usage
#include "ggml-quants.h" // Quantize float data to Q4_0 float input[256]; block_q4_0 output[256 / QK4_0]; quantize_row_q4_0_ref(input, output, 256); // Dequantize back to float float restored[256]; dequantize_row_q4_0(output, restored, 256); // Quantize with importance matrix size_t bytes = quantize_q4_0(src, dst, nrows, n_per_row, imatrix);
Code Reference
Source Location
| Repository | File | Lines |
|---|---|---|
| ggml-org/ggml | src/ggml-quants.h |
106 |
Key Signatures
// Reference quantization (float32 -> quantized block)
GGML_API void quantize_row_q4_0_ref(const float * GGML_RESTRICT x, block_q4_0 * GGML_RESTRICT y, int64_t k);
GGML_API void quantize_row_q8_0_ref(const float * GGML_RESTRICT x, block_q8_0 * GGML_RESTRICT y, int64_t k);
GGML_API void quantize_row_mxfp4_ref(const float * GGML_RESTRICT x, block_mxfp4 * GGML_RESTRICT y, int64_t k);
// Dequantization (quantized block -> float32)
GGML_API void dequantize_row_q4_0(const block_q4_0 * GGML_RESTRICT x, float * GGML_RESTRICT y, int64_t k);
GGML_API void dequantize_row_q8_0(const block_q8_0 * GGML_RESTRICT x, float * GGML_RESTRICT y, int64_t k);
GGML_API void dequantize_row_iq4_nl(const block_iq4_nl * GGML_RESTRICT x, float * GGML_RESTRICT y, int64_t k);
// Importance-matrix aware quantization (AWQ)
GGML_API size_t quantize_q4_0(const float * GGML_RESTRICT src, void * GGML_RESTRICT dst,
int64_t nrows, int64_t n_per_row, const float * imatrix);
GGML_API size_t quantize_iq4_nl(const float * GGML_RESTRICT src, void * GGML_RESTRICT dst,
int64_t nrows, int64_t n_per_row, const float * imatrix);
// Lookup table management for IQ formats
GGML_API void iq2xs_init_impl(enum ggml_type type);
GGML_API void iq2xs_free_impl(enum ggml_type type);
I/O Contract
Inputs
- Float data -- Source float32 array for quantization
- Quantized blocks -- Source quantized block array for dequantization
- k -- Number of elements to process
- imatrix -- Optional importance matrix for AWQ quantization (can be NULL)
Outputs
- Quantized blocks -- Packed quantized representation
- Float data -- Dequantized float32 values
- Size -- Number of bytes written by importance-matrix quantization functions
Usage Examples
Quantization round-trip:
#include "ggml-quants.h"
// Quantize 256 floats to Q4_0 (4-bit quantization)
float src[256] = { /* ... */ };
block_q4_0 quantized[256 / QK4_0];
quantize_row_q4_0_ref(src, quantized, 256);
// Dequantize back to float
float dst[256];
dequantize_row_q4_0(quantized, dst, 256);
// AWQ quantization with importance matrix
float imatrix[256] = { /* activation importance weights */ };
size_t nbytes = quantize_q4_0(src, quantized, 1, 256, imatrix);
Related Pages
Implements Principle
Related Implementations
- Implementation:Ggml_org_Ggml_Ggml_quantize_chunk -- Higher-level quantization API
- Implementation:Ggml_org_Ggml_Python_utils -- Python bindings for quantization
- Implementation:Ggml_org_Ggml_Hexagon_matmul_ops -- Hexagon uses quantized formats