Implementation:Ggml org Ggml Cpu arm repack

Metadata

Field	Value
Page Type	Implementation (Architecture-Specific SIMD)
Knowledge Sources	GGML
Domains	ML_Infrastructure, Tensor_Computing, SIMD_Optimization
Last Updated	2025-05-15 12:00 GMT

Overview

ARM NEON-optimized matrix repacking, quantized GEMM, and GEMV kernels for interleaved block formats used in high-performance matrix multiplication on AArch64.

Description

arch/arm/repack.cpp provides ARM NEON implementations for two categories of operations that support optimized quantized matrix multiplication:

Matrix repacking functions (ggml_quantize_mat_q8_0_4x4, ggml_quantize_mat_q8_0_4x8) quantize multiple rows of floating-point data simultaneously into interleaved block formats (e.g., block_q8_0x4). These interleaved layouts place elements from 4 or 8 adjacent rows into a single block, enabling more efficient SIMD processing during matrix multiplication. The quantization uses the same NEON max-finding and rounding pattern as the standard quantization functions.

GEMV (matrix-vector) kernels (ggml_gemv_q4_0_4x4_q8_0, ggml_gemv_q4_0_4x8_q8_0, ggml_gemv_q4_0_8x8_q8_0, ggml_gemv_iq4_nl_4x4_q8_0, ggml_gemv_q4_K_8x4_q8_K, ggml_gemv_q4_K_8x8_q8_K, ggml_gemv_q5_K_8x8_q8_K, ggml_gemv_q6_K_8x8_q8_K, ggml_gemv_q8_0_4x4_q8_0, ggml_gemv_q8_0_4x8_q8_0) perform matrix-vector products using NEON dotprod or I8MM instructions on interleaved weight blocks.

GEMM (matrix-matrix) kernels (ggml_gemm_q4_0_4x4_q8_0, ggml_gemm_q4_0_4x8_q8_0, ggml_gemm_q4_0_8x8_q8_0, ggml_gemm_iq4_nl_4x4_q8_0, ggml_gemm_q4_K_8x4_q8_K, ggml_gemm_q4_K_8x8_q8_K, ggml_gemm_q5_K_8x8_q8_K, ggml_gemm_q6_K_8x8_q8_K) perform matrix-matrix products with the same interleaved formats.

A helper function decode_q_Kx8_6bit_scales decodes packed 6-bit scale and minimum values from Q4_K and Q5_K block formats for use in the K-quant GEMV/GEMM kernels. The file requires __aarch64__ and __ARM_NEON, with some functions further gated on __ARM_FEATURE_MATMUL_INT8 or __ARM_FEATURE_DOTPROD.

Usage

This file is compiled when the GGML CPU backend targets AArch64 with NEON. The repacking functions are called during model loading or weight preparation, while the GEMV/GEMM kernels are invoked during inference whenever the scheduler selects the interleaved matrix multiplication path.

Code Reference

Source Location

GGML repo, file: src/ggml-cpu/arch/arm/repack.cpp (3847 lines).

Key Signatures

void ggml_quantize_mat_q8_0_4x4(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, int64_t k);
void ggml_quantize_mat_q8_0_4x8(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, int64_t k);

void ggml_gemv_q4_0_4x4_q8_0(int n, float * GGML_RESTRICT s, size_t bs,
    const void * GGML_RESTRICT vx, const void * GGML_RESTRICT vy, int nr, int nc);
void ggml_gemm_q4_0_4x4_q8_0(int n, float * GGML_RESTRICT s, size_t bs,
    const void * GGML_RESTRICT vx, const void * GGML_RESTRICT vy, int nr, int nc);

Import

#include "ggml-common.h"
#include "ggml-backend-impl.h"
#include "ggml-cpu.h"
#include "../../repack.h"

I/O Contract

Inputs (Repacking)

Parameter	Type	Description
`x`	`const float *`	Source floating-point data laid out as consecutive rows. The function reads 4 (or 8) rows simultaneously.
`k`	`int64_t`	Number of elements per row. Must be a multiple of the block size (32 for q8_0).

Outputs (Repacking)

Output	Type	Description
`vy`	`void *`	Destination buffer for interleaved quantized blocks (e.g., `block_q8_0x4`).

Inputs (GEMV / GEMM)

Parameter	Type	Description
`n`	`int`	Inner dimension size (number of elements per dot product).
`vx`	`const void *`	Pointer to the interleaved quantized weight matrix.
`vy`	`const void *`	Pointer to the quantized activation vector or matrix.
`nr`	`int`	Number of rows in the output.
`nc`	`int`	Number of columns in the output.

Outputs (GEMV / GEMM)

Output	Type	Description
`s`	`float *`	Destination buffer for the floating-point result matrix/vector.

Usage Examples

// Repack 4 rows of float data into interleaved q8_0x4 blocks for ARM NEON
float rows[4 * 256];  // 4 rows of 256 elements each
block_q8_0x4 repacked[256 / QK8_0];  // interleaved output

ggml_quantize_mat_q8_0_4x4(rows, repacked, 256);

// Perform GEMV with interleaved q4_0 weights and q8_0 activations
float output[4];
ggml_gemv_q4_0_4x4_q8_0(256, output, sizeof(float),
    interleaved_weights, quantized_activations, 1, 4);

Related Pages

Principle:Ggml_org_Ggml_Architecture_Specific_SIMD_Quantization
Implementation:Ggml_org_Ggml_Cpu_arm_quants -- ARM NEON quantization and dot product routines
Implementation:Ggml_org_Ggml_Cpu_x86_repack -- x86 AVX/AVX-512 equivalent
Implementation:Ggml_org_Ggml_Cpu_riscv_repack -- RISC-V RVV equivalent

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment