Implementation:Ggml org Ggml Cpu arm repack
Metadata
| Field | Value |
|---|---|
| Page Type | Implementation (Architecture-Specific SIMD) |
| Knowledge Sources | GGML |
| Domains | ML_Infrastructure, Tensor_Computing, SIMD_Optimization |
| Last Updated | 2025-05-15 12:00 GMT |
Overview
ARM NEON-optimized matrix repacking, quantized GEMM, and GEMV kernels for interleaved block formats used in high-performance matrix multiplication on AArch64.
Description
arch/arm/repack.cpp provides ARM NEON implementations for two categories of operations that support optimized quantized matrix multiplication:
Matrix repacking functions (ggml_quantize_mat_q8_0_4x4, ggml_quantize_mat_q8_0_4x8) quantize multiple rows of floating-point data simultaneously into interleaved block formats (e.g., block_q8_0x4). These interleaved layouts place elements from 4 or 8 adjacent rows into a single block, enabling more efficient SIMD processing during matrix multiplication. The quantization uses the same NEON max-finding and rounding pattern as the standard quantization functions.
GEMV (matrix-vector) kernels (ggml_gemv_q4_0_4x4_q8_0, ggml_gemv_q4_0_4x8_q8_0, ggml_gemv_q4_0_8x8_q8_0, ggml_gemv_iq4_nl_4x4_q8_0, ggml_gemv_q4_K_8x4_q8_K, ggml_gemv_q4_K_8x8_q8_K, ggml_gemv_q5_K_8x8_q8_K, ggml_gemv_q6_K_8x8_q8_K, ggml_gemv_q8_0_4x4_q8_0, ggml_gemv_q8_0_4x8_q8_0) perform matrix-vector products using NEON dotprod or I8MM instructions on interleaved weight blocks.
GEMM (matrix-matrix) kernels (ggml_gemm_q4_0_4x4_q8_0, ggml_gemm_q4_0_4x8_q8_0, ggml_gemm_q4_0_8x8_q8_0, ggml_gemm_iq4_nl_4x4_q8_0, ggml_gemm_q4_K_8x4_q8_K, ggml_gemm_q4_K_8x8_q8_K, ggml_gemm_q5_K_8x8_q8_K, ggml_gemm_q6_K_8x8_q8_K) perform matrix-matrix products with the same interleaved formats.
A helper function decode_q_Kx8_6bit_scales decodes packed 6-bit scale and minimum values from Q4_K and Q5_K block formats for use in the K-quant GEMV/GEMM kernels. The file requires __aarch64__ and __ARM_NEON, with some functions further gated on __ARM_FEATURE_MATMUL_INT8 or __ARM_FEATURE_DOTPROD.
Usage
This file is compiled when the GGML CPU backend targets AArch64 with NEON. The repacking functions are called during model loading or weight preparation, while the GEMV/GEMM kernels are invoked during inference whenever the scheduler selects the interleaved matrix multiplication path.
Code Reference
Source Location
GGML repo, file: src/ggml-cpu/arch/arm/repack.cpp (3847 lines).
Key Signatures
void ggml_quantize_mat_q8_0_4x4(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, int64_t k);
void ggml_quantize_mat_q8_0_4x8(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, int64_t k);
void ggml_gemv_q4_0_4x4_q8_0(int n, float * GGML_RESTRICT s, size_t bs,
const void * GGML_RESTRICT vx, const void * GGML_RESTRICT vy, int nr, int nc);
void ggml_gemm_q4_0_4x4_q8_0(int n, float * GGML_RESTRICT s, size_t bs,
const void * GGML_RESTRICT vx, const void * GGML_RESTRICT vy, int nr, int nc);
Import
#include "ggml-common.h"
#include "ggml-backend-impl.h"
#include "ggml-cpu.h"
#include "../../repack.h"
I/O Contract
Inputs (Repacking)
| Parameter | Type | Description |
|---|---|---|
x |
const float * |
Source floating-point data laid out as consecutive rows. The function reads 4 (or 8) rows simultaneously. |
k |
int64_t |
Number of elements per row. Must be a multiple of the block size (32 for q8_0). |
Outputs (Repacking)
| Output | Type | Description |
|---|---|---|
vy |
void * |
Destination buffer for interleaved quantized blocks (e.g., block_q8_0x4).
|
Inputs (GEMV / GEMM)
| Parameter | Type | Description |
|---|---|---|
n |
int |
Inner dimension size (number of elements per dot product). |
vx |
const void * |
Pointer to the interleaved quantized weight matrix. |
vy |
const void * |
Pointer to the quantized activation vector or matrix. |
nr |
int |
Number of rows in the output. |
nc |
int |
Number of columns in the output. |
Outputs (GEMV / GEMM)
| Output | Type | Description |
|---|---|---|
s |
float * |
Destination buffer for the floating-point result matrix/vector. |
Usage Examples
// Repack 4 rows of float data into interleaved q8_0x4 blocks for ARM NEON
float rows[4 * 256]; // 4 rows of 256 elements each
block_q8_0x4 repacked[256 / QK8_0]; // interleaved output
ggml_quantize_mat_q8_0_4x4(rows, repacked, 256);
// Perform GEMV with interleaved q4_0 weights and q8_0 activations
float output[4];
ggml_gemv_q4_0_4x4_q8_0(256, output, sizeof(float),
interleaved_weights, quantized_activations, 1, 4);
Related Pages
- Principle:Ggml_org_Ggml_Architecture_Specific_SIMD_Quantization
- Implementation:Ggml_org_Ggml_Cpu_arm_quants -- ARM NEON quantization and dot product routines
- Implementation:Ggml_org_Ggml_Cpu_x86_repack -- x86 AVX/AVX-512 equivalent
- Implementation:Ggml_org_Ggml_Cpu_riscv_repack -- RISC-V RVV equivalent