Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ggml org Ggml Cpu quantization

From Leeroopedia


Metadata

Field Value
Page Type Implementation (Quantization)
Knowledge Sources GGML
Domains ML_Infrastructure, Tensor_Computing, CPU_Backend, Quantization
Last Updated 2025-05-15 12:00 GMT

Overview

CPU-specific quantization row functions and quantized dot product implementations, providing generic fallback implementations that can be overridden by architecture-specific versions.

Description

quants.c implements the quantization and dequantization primitives essential for running quantized models on CPU. It provides:

  1. Quantization row functions: Converts float arrays to quantized block formats. Covers q4_0, q4_1, q5_0, q5_1, q8_0, q8_1, q2_K through q6_K, q8_K, tq1_0 (ternary), tq2_0, iq4_nl, iq4_xs, and mxfp4 formats. Most delegate to *_ref reference implementations from ggml-quants.h.
  2. Quantized dot products: Generic ggml_vec_dot_q*_generic functions that compute the dot product between quantized vectors. For example, ggml_vec_dot_q4_0_q8_0_generic unpacks 4-bit nibbles, multiplies with q8_0 values, and accumulates with scale factors.
  3. Architecture override mechanism: Functions are named with a _generic suffix. The arch-fallback.h header renames them to canonical names when no architecture-specific optimized version exists. Optimized versions for ARM NEON, x86 AVX/AVX2, etc., live in arch/arm/quants.c, arch/x86/quants.c, etc.

Usage

These functions are called internally by the matrix multiplication and tensor copy operations. They are not typically called directly by user code.

Code Reference

Source Location

GGML repo, file: src/ggml-cpu/quants.c (1193 lines).

Signature

// Quantization row functions
void quantize_row_q4_0(const float * GGML_RESTRICT x, void * GGML_RESTRICT y, int64_t k);
void quantize_row_q4_1(const float * GGML_RESTRICT x, void * GGML_RESTRICT y, int64_t k);
void quantize_row_q8_0_generic(const float * GGML_RESTRICT x, void * GGML_RESTRICT y, int64_t k);
void quantize_row_q2_K(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, int64_t k);
void quantize_row_q4_K(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, int64_t k);

// Quantized dot product functions
void ggml_vec_dot_q4_0_q8_0_generic(int n, float * GGML_RESTRICT s, size_t bs,
    const void * GGML_RESTRICT vx, size_t bx,
    const void * GGML_RESTRICT vy, size_t by, int nrc);

Import

#include "quants.h"

I/O Contract

Inputs

Parameter Type Required Description
x const float * Yes Source float array to quantize.
y / vy void * Yes Destination buffer for quantized blocks.
k int64_t Yes Number of elements (must be a multiple of the block size, e.g., 32 for q4_0).
n int Yes (dot) Number of elements in dot product vectors.
vx, vy const void * Yes (dot) Quantized input vectors for dot product.

Outputs

Output Type Description
y void * Quantized block data written to the output buffer.
s float * Scalar dot product result (for dot product functions).

Usage Examples

Quantizing a Float Vector to Q4_0

#include "quants.h"

float data[256] = { /* ... */ };
block_q4_0 quantized[256 / QK4_0];

// Quantize 256 floats into q4_0 blocks
quantize_row_q4_0(data, quantized, 256);

Computing a Quantized Dot Product

#include "quants.h"

float result;
ggml_vec_dot_q4_0_q8_0_generic(
    256,          // n elements
    &result,      // output scalar
    0,            // bs (unused)
    q4_data,      // quantized q4_0 vector
    0,            // bx (unused)
    q8_data,      // quantized q8_0 vector
    0,            // by (unused)
    1             // nrc (number of rows to compute)
);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment