Implementation:Ggml org Ggml Cpu arm quants

Metadata

Field	Value
Page Type	Implementation (Architecture-Specific SIMD)
Knowledge Sources	GGML
Domains	ML_Infrastructure, Tensor_Computing, SIMD_Optimization
Last Updated	2025-05-15 12:00 GMT

Overview

ARM NEON SIMD-optimized quantization, dequantization, and dot product routines for all GGML quantized tensor formats on AArch64 processors.

Description

arch/arm/quants.c implements the ARM NEON accelerated path for the full suite of GGML quantization operations. Each quantization function processes data in 32-element blocks using NEON intrinsics to achieve high throughput on ARM devices.

The implementation covers three categories of operations:

Quantization functions (quantize_row_q8_0, quantize_row_q8_1, quantize_row_q8_K) convert floating-point data into quantized block formats. The pattern for each is:

Load eight groups of four floats into float32x4_t vectors
Compute the absolute maximum via tree reduction using vabsq_f32 and vmaxq_f32, then extract a scalar max with vmaxvq_f32
Derive a scale factor d = amax / 127 and its inverse
Multiply each element by the inverse scale and round to integer via vcvtnq_s32_f32
Store the packed int8 results into the output block

Dot product functions (ggml_vec_dot_q4_0_q8_0, ggml_vec_dot_q4_1_q8_1, ggml_vec_dot_q5_0_q8_0, ggml_vec_dot_q5_1_q8_1, ggml_vec_dot_q8_0_q8_0, and the K-quant variants ggml_vec_dot_q2_K_q8_K through ggml_vec_dot_q6_K_q8_K) perform quantized inner products between weight and activation blocks using NEON multiply-accumulate operations.

IQ (importance quantization) dot products (ggml_vec_dot_iq2_xxs_q8_K, ggml_vec_dot_iq2_xs_q8_K, ggml_vec_dot_iq2_s_q8_K, ggml_vec_dot_iq3_xxs_q8_K, and others) support the newer importance-weighted quantization formats.

Precomputed lookup tables (table_b2b_0, table_b2b_1) expand 8-bit patterns to 8-byte vectors for efficient unpacking of sub-byte quantized formats. All NEON paths are guarded by #if defined(__ARM_NEON) and fall back to scalar reference implementations when NEON is unavailable.

Usage

This file is compiled as part of the GGML CPU backend when targeting ARM platforms (AArch64 with NEON). It is used automatically by the quantization and inference pipeline -- callers invoke the generic function names (e.g., quantize_row_q8_0) and the build system selects this architecture-specific implementation.

Code Reference

Source Location

GGML repo, file: src/ggml-cpu/arch/arm/quants.c (4052 lines).

Key Signatures

void quantize_row_q8_0(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, int64_t k);
void quantize_row_q8_1(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, int64_t k);
void quantize_row_q8_K(const float * GGML_RESTRICT x, void * GGML_RESTRICT y, int64_t k);

void ggml_vec_dot_q4_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs,
    const void * GGML_RESTRICT vx, size_t bx,
    const void * GGML_RESTRICT vy, size_t by, int nrc);
void ggml_vec_dot_q8_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs,
    const void * GGML_RESTRICT vx, size_t bx,
    const void * GGML_RESTRICT vy, size_t by, int nrc);

Import

#include "ggml-quants.h"
#include "ggml-cpu.h"
#include "simd-mappings.h"

I/O Contract

Inputs (Quantization)

Parameter	Type	Description
`x`	`const float *`	Source array of floating-point values to be quantized. Must contain at least `k` elements.
`k`	`int64_t`	Number of elements to quantize. Must be a multiple of the block size (32 for q8_0/q8_1, 256 for q8_K).

Outputs (Quantization)

Output	Type	Description
`vy` / `y`	`void *`	Destination buffer for the quantized block data. Must be pre-allocated with sufficient space for `k / block_size` blocks.

Inputs (Dot Product)

Parameter	Type	Description
`n`	`int`	Number of elements in each input vector (must be a multiple of the block size).
`vx`	`const void *`	Pointer to the first quantized input vector (weights).
`vy`	`const void *`	Pointer to the second quantized input vector (activations).
`nrc`	`int`	Number of rows to compute simultaneously (for batched operations).

Outputs (Dot Product)

Output	Type	Description
`s`	`float *`	Destination for the computed dot product result(s).

Usage Examples

// Quantize a row of 256 floats to q8_0 format using ARM NEON
float input[256];
block_q8_0 output[256 / QK8_0];  // QK8_0 == 32, so 8 blocks

// Fill input with data...
quantize_row_q8_0(input, output, 256);

// Compute dot product between q4_0 weights and q8_0 activations
float result;
ggml_vec_dot_q4_0_q8_0(256, &result, sizeof(result),
    weight_blocks, sizeof(block_q4_0),
    activation_blocks, sizeof(block_q8_0), 1);

Related Pages

Principle:Ggml_org_Ggml_Architecture_Specific_SIMD_Quantization
Implementation:Ggml_org_Ggml_Cpu_arm_repack -- ARM NEON matrix repacking and GEMM/GEMV kernels
Implementation:Ggml_org_Ggml_Cpu_x86_quants -- x86 SSE/AVX equivalent
Implementation:Ggml_org_Ggml_Cpu_loongarch_quants -- LoongArch LSX equivalent
Implementation:Ggml_org_Ggml_Cpu_powerpc_quants -- PowerPC VSX equivalent
Implementation:Ggml_org_Ggml_Cpu_riscv_quants -- RISC-V RVV equivalent
Implementation:Ggml_org_Ggml_Cpu_s390_quants -- s390x VXE equivalent
Implementation:Ggml_org_Ggml_Cpu_wasm_quants -- WebAssembly SIMD128 equivalent

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment