Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ggml org Ggml Cpu arm quants

From Leeroopedia


Metadata

Field Value
Page Type Implementation (Architecture-Specific SIMD)
Knowledge Sources GGML
Domains ML_Infrastructure, Tensor_Computing, SIMD_Optimization
Last Updated 2025-05-15 12:00 GMT

Overview

ARM NEON SIMD-optimized quantization, dequantization, and dot product routines for all GGML quantized tensor formats on AArch64 processors.

Description

arch/arm/quants.c implements the ARM NEON accelerated path for the full suite of GGML quantization operations. Each quantization function processes data in 32-element blocks using NEON intrinsics to achieve high throughput on ARM devices.

The implementation covers three categories of operations:

Quantization functions (quantize_row_q8_0, quantize_row_q8_1, quantize_row_q8_K) convert floating-point data into quantized block formats. The pattern for each is:

  1. Load eight groups of four floats into float32x4_t vectors
  2. Compute the absolute maximum via tree reduction using vabsq_f32 and vmaxq_f32, then extract a scalar max with vmaxvq_f32
  3. Derive a scale factor d = amax / 127 and its inverse
  4. Multiply each element by the inverse scale and round to integer via vcvtnq_s32_f32
  5. Store the packed int8 results into the output block

Dot product functions (ggml_vec_dot_q4_0_q8_0, ggml_vec_dot_q4_1_q8_1, ggml_vec_dot_q5_0_q8_0, ggml_vec_dot_q5_1_q8_1, ggml_vec_dot_q8_0_q8_0, and the K-quant variants ggml_vec_dot_q2_K_q8_K through ggml_vec_dot_q6_K_q8_K) perform quantized inner products between weight and activation blocks using NEON multiply-accumulate operations.

IQ (importance quantization) dot products (ggml_vec_dot_iq2_xxs_q8_K, ggml_vec_dot_iq2_xs_q8_K, ggml_vec_dot_iq2_s_q8_K, ggml_vec_dot_iq3_xxs_q8_K, and others) support the newer importance-weighted quantization formats.

Precomputed lookup tables (table_b2b_0, table_b2b_1) expand 8-bit patterns to 8-byte vectors for efficient unpacking of sub-byte quantized formats. All NEON paths are guarded by #if defined(__ARM_NEON) and fall back to scalar reference implementations when NEON is unavailable.

Usage

This file is compiled as part of the GGML CPU backend when targeting ARM platforms (AArch64 with NEON). It is used automatically by the quantization and inference pipeline -- callers invoke the generic function names (e.g., quantize_row_q8_0) and the build system selects this architecture-specific implementation.

Code Reference

Source Location

GGML repo, file: src/ggml-cpu/arch/arm/quants.c (4052 lines).

Key Signatures

void quantize_row_q8_0(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, int64_t k);
void quantize_row_q8_1(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, int64_t k);
void quantize_row_q8_K(const float * GGML_RESTRICT x, void * GGML_RESTRICT y, int64_t k);

void ggml_vec_dot_q4_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs,
    const void * GGML_RESTRICT vx, size_t bx,
    const void * GGML_RESTRICT vy, size_t by, int nrc);
void ggml_vec_dot_q8_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs,
    const void * GGML_RESTRICT vx, size_t bx,
    const void * GGML_RESTRICT vy, size_t by, int nrc);

Import

#include "ggml-quants.h"
#include "ggml-cpu.h"
#include "simd-mappings.h"

I/O Contract

Inputs (Quantization)

Parameter Type Description
x const float * Source array of floating-point values to be quantized. Must contain at least k elements.
k int64_t Number of elements to quantize. Must be a multiple of the block size (32 for q8_0/q8_1, 256 for q8_K).

Outputs (Quantization)

Output Type Description
vy / y void * Destination buffer for the quantized block data. Must be pre-allocated with sufficient space for k / block_size blocks.

Inputs (Dot Product)

Parameter Type Description
n int Number of elements in each input vector (must be a multiple of the block size).
vx const void * Pointer to the first quantized input vector (weights).
vy const void * Pointer to the second quantized input vector (activations).
nrc int Number of rows to compute simultaneously (for batched operations).

Outputs (Dot Product)

Output Type Description
s float * Destination for the computed dot product result(s).

Usage Examples

// Quantize a row of 256 floats to q8_0 format using ARM NEON
float input[256];
block_q8_0 output[256 / QK8_0];  // QK8_0 == 32, so 8 blocks

// Fill input with data...
quantize_row_q8_0(input, output, 256);

// Compute dot product between q4_0 weights and q8_0 activations
float result;
ggml_vec_dot_q4_0_q8_0(256, &result, sizeof(result),
    weight_blocks, sizeof(block_q4_0),
    activation_blocks, sizeof(block_q8_0), 1);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment