Implementation:Ggml org Ggml Cpu arm quants
Metadata
| Field | Value |
|---|---|
| Page Type | Implementation (Architecture-Specific SIMD) |
| Knowledge Sources | GGML |
| Domains | ML_Infrastructure, Tensor_Computing, SIMD_Optimization |
| Last Updated | 2025-05-15 12:00 GMT |
Overview
ARM NEON SIMD-optimized quantization, dequantization, and dot product routines for all GGML quantized tensor formats on AArch64 processors.
Description
arch/arm/quants.c implements the ARM NEON accelerated path for the full suite of GGML quantization operations. Each quantization function processes data in 32-element blocks using NEON intrinsics to achieve high throughput on ARM devices.
The implementation covers three categories of operations:
Quantization functions (quantize_row_q8_0, quantize_row_q8_1, quantize_row_q8_K) convert floating-point data into quantized block formats. The pattern for each is:
- Load eight groups of four floats into
float32x4_tvectors - Compute the absolute maximum via tree reduction using
vabsq_f32andvmaxq_f32, then extract a scalar max withvmaxvq_f32 - Derive a scale factor
d = amax / 127and its inverse - Multiply each element by the inverse scale and round to integer via
vcvtnq_s32_f32 - Store the packed int8 results into the output block
Dot product functions (ggml_vec_dot_q4_0_q8_0, ggml_vec_dot_q4_1_q8_1, ggml_vec_dot_q5_0_q8_0, ggml_vec_dot_q5_1_q8_1, ggml_vec_dot_q8_0_q8_0, and the K-quant variants ggml_vec_dot_q2_K_q8_K through ggml_vec_dot_q6_K_q8_K) perform quantized inner products between weight and activation blocks using NEON multiply-accumulate operations.
IQ (importance quantization) dot products (ggml_vec_dot_iq2_xxs_q8_K, ggml_vec_dot_iq2_xs_q8_K, ggml_vec_dot_iq2_s_q8_K, ggml_vec_dot_iq3_xxs_q8_K, and others) support the newer importance-weighted quantization formats.
Precomputed lookup tables (table_b2b_0, table_b2b_1) expand 8-bit patterns to 8-byte vectors for efficient unpacking of sub-byte quantized formats. All NEON paths are guarded by #if defined(__ARM_NEON) and fall back to scalar reference implementations when NEON is unavailable.
Usage
This file is compiled as part of the GGML CPU backend when targeting ARM platforms (AArch64 with NEON). It is used automatically by the quantization and inference pipeline -- callers invoke the generic function names (e.g., quantize_row_q8_0) and the build system selects this architecture-specific implementation.
Code Reference
Source Location
GGML repo, file: src/ggml-cpu/arch/arm/quants.c (4052 lines).
Key Signatures
void quantize_row_q8_0(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, int64_t k);
void quantize_row_q8_1(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, int64_t k);
void quantize_row_q8_K(const float * GGML_RESTRICT x, void * GGML_RESTRICT y, int64_t k);
void ggml_vec_dot_q4_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs,
const void * GGML_RESTRICT vx, size_t bx,
const void * GGML_RESTRICT vy, size_t by, int nrc);
void ggml_vec_dot_q8_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs,
const void * GGML_RESTRICT vx, size_t bx,
const void * GGML_RESTRICT vy, size_t by, int nrc);
Import
#include "ggml-quants.h"
#include "ggml-cpu.h"
#include "simd-mappings.h"
I/O Contract
Inputs (Quantization)
| Parameter | Type | Description |
|---|---|---|
x |
const float * |
Source array of floating-point values to be quantized. Must contain at least k elements.
|
k |
int64_t |
Number of elements to quantize. Must be a multiple of the block size (32 for q8_0/q8_1, 256 for q8_K). |
Outputs (Quantization)
| Output | Type | Description |
|---|---|---|
vy / y |
void * |
Destination buffer for the quantized block data. Must be pre-allocated with sufficient space for k / block_size blocks.
|
Inputs (Dot Product)
| Parameter | Type | Description |
|---|---|---|
n |
int |
Number of elements in each input vector (must be a multiple of the block size). |
vx |
const void * |
Pointer to the first quantized input vector (weights). |
vy |
const void * |
Pointer to the second quantized input vector (activations). |
nrc |
int |
Number of rows to compute simultaneously (for batched operations). |
Outputs (Dot Product)
| Output | Type | Description |
|---|---|---|
s |
float * |
Destination for the computed dot product result(s). |
Usage Examples
// Quantize a row of 256 floats to q8_0 format using ARM NEON
float input[256];
block_q8_0 output[256 / QK8_0]; // QK8_0 == 32, so 8 blocks
// Fill input with data...
quantize_row_q8_0(input, output, 256);
// Compute dot product between q4_0 weights and q8_0 activations
float result;
ggml_vec_dot_q4_0_q8_0(256, &result, sizeof(result),
weight_blocks, sizeof(block_q4_0),
activation_blocks, sizeof(block_q8_0), 1);
Related Pages
- Principle:Ggml_org_Ggml_Architecture_Specific_SIMD_Quantization
- Implementation:Ggml_org_Ggml_Cpu_arm_repack -- ARM NEON matrix repacking and GEMM/GEMV kernels
- Implementation:Ggml_org_Ggml_Cpu_x86_quants -- x86 SSE/AVX equivalent
- Implementation:Ggml_org_Ggml_Cpu_loongarch_quants -- LoongArch LSX equivalent
- Implementation:Ggml_org_Ggml_Cpu_powerpc_quants -- PowerPC VSX equivalent
- Implementation:Ggml_org_Ggml_Cpu_riscv_quants -- RISC-V RVV equivalent
- Implementation:Ggml_org_Ggml_Cpu_s390_quants -- s390x VXE equivalent
- Implementation:Ggml_org_Ggml_Cpu_wasm_quants -- WebAssembly SIMD128 equivalent