Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ggml org Ggml Quants api

From Leeroopedia


Implementation Metadata
File Name src/ggml-quants.h
Repository ggml-org/ggml
Lines 106
Language C
Domain Tags ML_Infrastructure, Quantization, API_Design
Status Active
Last Updated 2025-05-15 12:00 GMT
Knowledge Sources ggml-org/ggml repository

Overview

src/ggml-quants.h is the header declaring all quantization, dequantization, and importance-matrix quantization function prototypes for GGML's supported quantized formats. It serves as the public contract between the quantization implementation (ggml-quants.c) and its consumers (the CPU backend, model loading/saving code).

Description

The file declares three categories of functions:

1. Reference quantization (quantize_row_*_ref) -- Deterministic float32-to-quantized-block conversion:

  • Standard formats: q4_0, q4_1, q5_0, q5_1, q8_0, q8_1
  • K-quant formats: q2_K, q3_K, q4_K, q5_K, q6_K, q8_K
  • Special formats: mxfp4, tq1_0, tq2_0
  • IQ formats: iq3_xxs, iq4_nl, iq4_xs, iq3_s, iq2_s

2. Dequantization (dequantize_row_*) -- Quantized-block-to-float32 conversion for all formats including the full IQ family (iq2_xxs, iq2_xs, iq2_s, iq3_xxs, iq1_s, iq1_m, iq4_nl, iq4_xs, iq3_s).

3. Importance-matrix quantization (quantize_*) -- Higher-level functions accepting an imatrix parameter for activation-aware quantization (AWQ), producing better quality at the same bit rate.

Also declares init/free functions for iq2xs and iq3xs lookup tables. All functions are marked GGML_API since they are used by the CPU backend.

Usage

#include "ggml-quants.h"

// Quantize float data to Q4_0
float input[256];
block_q4_0 output[256 / QK4_0];
quantize_row_q4_0_ref(input, output, 256);

// Dequantize back to float
float restored[256];
dequantize_row_q4_0(output, restored, 256);

// Quantize with importance matrix
size_t bytes = quantize_q4_0(src, dst, nrows, n_per_row, imatrix);

Code Reference

Source Location

Repository File Lines
ggml-org/ggml src/ggml-quants.h 106

Key Signatures

// Reference quantization (float32 -> quantized block)
GGML_API void quantize_row_q4_0_ref(const float * GGML_RESTRICT x, block_q4_0 * GGML_RESTRICT y, int64_t k);
GGML_API void quantize_row_q8_0_ref(const float * GGML_RESTRICT x, block_q8_0 * GGML_RESTRICT y, int64_t k);
GGML_API void quantize_row_mxfp4_ref(const float * GGML_RESTRICT x, block_mxfp4 * GGML_RESTRICT y, int64_t k);

// Dequantization (quantized block -> float32)
GGML_API void dequantize_row_q4_0(const block_q4_0 * GGML_RESTRICT x, float * GGML_RESTRICT y, int64_t k);
GGML_API void dequantize_row_q8_0(const block_q8_0 * GGML_RESTRICT x, float * GGML_RESTRICT y, int64_t k);
GGML_API void dequantize_row_iq4_nl(const block_iq4_nl * GGML_RESTRICT x, float * GGML_RESTRICT y, int64_t k);

// Importance-matrix aware quantization (AWQ)
GGML_API size_t quantize_q4_0(const float * GGML_RESTRICT src, void * GGML_RESTRICT dst,
    int64_t nrows, int64_t n_per_row, const float * imatrix);
GGML_API size_t quantize_iq4_nl(const float * GGML_RESTRICT src, void * GGML_RESTRICT dst,
    int64_t nrows, int64_t n_per_row, const float * imatrix);

// Lookup table management for IQ formats
GGML_API void iq2xs_init_impl(enum ggml_type type);
GGML_API void iq2xs_free_impl(enum ggml_type type);

I/O Contract

Inputs

  • Float data -- Source float32 array for quantization
  • Quantized blocks -- Source quantized block array for dequantization
  • k -- Number of elements to process
  • imatrix -- Optional importance matrix for AWQ quantization (can be NULL)

Outputs

  • Quantized blocks -- Packed quantized representation
  • Float data -- Dequantized float32 values
  • Size -- Number of bytes written by importance-matrix quantization functions

Usage Examples

Quantization round-trip:

#include "ggml-quants.h"

// Quantize 256 floats to Q4_0 (4-bit quantization)
float src[256] = { /* ... */ };
block_q4_0 quantized[256 / QK4_0];
quantize_row_q4_0_ref(src, quantized, 256);

// Dequantize back to float
float dst[256];
dequantize_row_q4_0(quantized, dst, 256);

// AWQ quantization with importance matrix
float imatrix[256] = { /* activation importance weights */ };
size_t nbytes = quantize_q4_0(src, quantized, 1, 256, imatrix);

Related Pages

Implements Principle

Related Implementations

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment