Implementation:Ggml org Ggml Cpu quantization
Appearance
Metadata
| Field | Value |
|---|---|
| Page Type | Implementation (Quantization) |
| Knowledge Sources | GGML |
| Domains | ML_Infrastructure, Tensor_Computing, CPU_Backend, Quantization |
| Last Updated | 2025-05-15 12:00 GMT |
Overview
CPU-specific quantization row functions and quantized dot product implementations, providing generic fallback implementations that can be overridden by architecture-specific versions.
Description
quants.c implements the quantization and dequantization primitives essential for running quantized models on CPU. It provides:
- Quantization row functions: Converts float arrays to quantized block formats. Covers q4_0, q4_1, q5_0, q5_1, q8_0, q8_1, q2_K through q6_K, q8_K, tq1_0 (ternary), tq2_0, iq4_nl, iq4_xs, and mxfp4 formats. Most delegate to
*_refreference implementations fromggml-quants.h. - Quantized dot products: Generic
ggml_vec_dot_q*_genericfunctions that compute the dot product between quantized vectors. For example,ggml_vec_dot_q4_0_q8_0_genericunpacks 4-bit nibbles, multiplies with q8_0 values, and accumulates with scale factors. - Architecture override mechanism: Functions are named with a
_genericsuffix. Thearch-fallback.hheader renames them to canonical names when no architecture-specific optimized version exists. Optimized versions for ARM NEON, x86 AVX/AVX2, etc., live inarch/arm/quants.c,arch/x86/quants.c, etc.
Usage
These functions are called internally by the matrix multiplication and tensor copy operations. They are not typically called directly by user code.
Code Reference
Source Location
GGML repo, file: src/ggml-cpu/quants.c (1193 lines).
Signature
// Quantization row functions
void quantize_row_q4_0(const float * GGML_RESTRICT x, void * GGML_RESTRICT y, int64_t k);
void quantize_row_q4_1(const float * GGML_RESTRICT x, void * GGML_RESTRICT y, int64_t k);
void quantize_row_q8_0_generic(const float * GGML_RESTRICT x, void * GGML_RESTRICT y, int64_t k);
void quantize_row_q2_K(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, int64_t k);
void quantize_row_q4_K(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, int64_t k);
// Quantized dot product functions
void ggml_vec_dot_q4_0_q8_0_generic(int n, float * GGML_RESTRICT s, size_t bs,
const void * GGML_RESTRICT vx, size_t bx,
const void * GGML_RESTRICT vy, size_t by, int nrc);
Import
#include "quants.h"
I/O Contract
Inputs
| Parameter | Type | Required | Description |
|---|---|---|---|
x |
const float * |
Yes | Source float array to quantize. |
y / vy |
void * |
Yes | Destination buffer for quantized blocks. |
k |
int64_t |
Yes | Number of elements (must be a multiple of the block size, e.g., 32 for q4_0). |
n |
int |
Yes (dot) | Number of elements in dot product vectors. |
vx, vy |
const void * |
Yes (dot) | Quantized input vectors for dot product. |
Outputs
| Output | Type | Description |
|---|---|---|
y |
void * |
Quantized block data written to the output buffer. |
s |
float * |
Scalar dot product result (for dot product functions). |
Usage Examples
Quantizing a Float Vector to Q4_0
#include "quants.h"
float data[256] = { /* ... */ };
block_q4_0 quantized[256 / QK4_0];
// Quantize 256 floats into q4_0 blocks
quantize_row_q4_0(data, quantized, 256);
Computing a Quantized Dot Product
#include "quants.h"
float result;
ggml_vec_dot_q4_0_q8_0_generic(
256, // n elements
&result, // output scalar
0, // bs (unused)
q4_data, // quantized q4_0 vector
0, // bx (unused)
q8_data, // quantized q8_0 vector
0, // by (unused)
1 // nrc (number of rows to compute)
);
Related Pages
- Ggml_org_Ggml_Cpu_tensor_ops -- Matrix multiply operations that use quantized dot products.
- Ggml_org_Ggml_Cpu_vec_api -- Vectorized math primitives at the float level.
- Ggml_org_Ggml_Cpu_simd_mappings -- SIMD macros used by optimized quantization implementations.
- Ggml_org_Ggml_Cpu_amx_mmq -- AMX-accelerated quantized matmul that uses these quantization types.
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment