Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ggml org Ggml Cpu arm repack

From Leeroopedia


Metadata

Field Value
Page Type Implementation (Architecture-Specific SIMD)
Knowledge Sources GGML
Domains ML_Infrastructure, Tensor_Computing, SIMD_Optimization
Last Updated 2025-05-15 12:00 GMT

Overview

ARM NEON-optimized matrix repacking, quantized GEMM, and GEMV kernels for interleaved block formats used in high-performance matrix multiplication on AArch64.

Description

arch/arm/repack.cpp provides ARM NEON implementations for two categories of operations that support optimized quantized matrix multiplication:

Matrix repacking functions (ggml_quantize_mat_q8_0_4x4, ggml_quantize_mat_q8_0_4x8) quantize multiple rows of floating-point data simultaneously into interleaved block formats (e.g., block_q8_0x4). These interleaved layouts place elements from 4 or 8 adjacent rows into a single block, enabling more efficient SIMD processing during matrix multiplication. The quantization uses the same NEON max-finding and rounding pattern as the standard quantization functions.

GEMV (matrix-vector) kernels (ggml_gemv_q4_0_4x4_q8_0, ggml_gemv_q4_0_4x8_q8_0, ggml_gemv_q4_0_8x8_q8_0, ggml_gemv_iq4_nl_4x4_q8_0, ggml_gemv_q4_K_8x4_q8_K, ggml_gemv_q4_K_8x8_q8_K, ggml_gemv_q5_K_8x8_q8_K, ggml_gemv_q6_K_8x8_q8_K, ggml_gemv_q8_0_4x4_q8_0, ggml_gemv_q8_0_4x8_q8_0) perform matrix-vector products using NEON dotprod or I8MM instructions on interleaved weight blocks.

GEMM (matrix-matrix) kernels (ggml_gemm_q4_0_4x4_q8_0, ggml_gemm_q4_0_4x8_q8_0, ggml_gemm_q4_0_8x8_q8_0, ggml_gemm_iq4_nl_4x4_q8_0, ggml_gemm_q4_K_8x4_q8_K, ggml_gemm_q4_K_8x8_q8_K, ggml_gemm_q5_K_8x8_q8_K, ggml_gemm_q6_K_8x8_q8_K) perform matrix-matrix products with the same interleaved formats.

A helper function decode_q_Kx8_6bit_scales decodes packed 6-bit scale and minimum values from Q4_K and Q5_K block formats for use in the K-quant GEMV/GEMM kernels. The file requires __aarch64__ and __ARM_NEON, with some functions further gated on __ARM_FEATURE_MATMUL_INT8 or __ARM_FEATURE_DOTPROD.

Usage

This file is compiled when the GGML CPU backend targets AArch64 with NEON. The repacking functions are called during model loading or weight preparation, while the GEMV/GEMM kernels are invoked during inference whenever the scheduler selects the interleaved matrix multiplication path.

Code Reference

Source Location

GGML repo, file: src/ggml-cpu/arch/arm/repack.cpp (3847 lines).

Key Signatures

void ggml_quantize_mat_q8_0_4x4(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, int64_t k);
void ggml_quantize_mat_q8_0_4x8(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, int64_t k);

void ggml_gemv_q4_0_4x4_q8_0(int n, float * GGML_RESTRICT s, size_t bs,
    const void * GGML_RESTRICT vx, const void * GGML_RESTRICT vy, int nr, int nc);
void ggml_gemm_q4_0_4x4_q8_0(int n, float * GGML_RESTRICT s, size_t bs,
    const void * GGML_RESTRICT vx, const void * GGML_RESTRICT vy, int nr, int nc);

Import

#include "ggml-common.h"
#include "ggml-backend-impl.h"
#include "ggml-cpu.h"
#include "../../repack.h"

I/O Contract

Inputs (Repacking)

Parameter Type Description
x const float * Source floating-point data laid out as consecutive rows. The function reads 4 (or 8) rows simultaneously.
k int64_t Number of elements per row. Must be a multiple of the block size (32 for q8_0).

Outputs (Repacking)

Output Type Description
vy void * Destination buffer for interleaved quantized blocks (e.g., block_q8_0x4).

Inputs (GEMV / GEMM)

Parameter Type Description
n int Inner dimension size (number of elements per dot product).
vx const void * Pointer to the interleaved quantized weight matrix.
vy const void * Pointer to the quantized activation vector or matrix.
nr int Number of rows in the output.
nc int Number of columns in the output.

Outputs (GEMV / GEMM)

Output Type Description
s float * Destination buffer for the floating-point result matrix/vector.

Usage Examples

// Repack 4 rows of float data into interleaved q8_0x4 blocks for ARM NEON
float rows[4 * 256];  // 4 rows of 256 elements each
block_q8_0x4 repacked[256 / QK8_0];  // interleaved output

ggml_quantize_mat_q8_0_4x4(rows, repacked, 256);

// Perform GEMV with interleaved q4_0 weights and q8_0 activations
float output[4];
ggml_gemv_q4_0_4x4_q8_0(256, output, sizeof(float),
    interleaved_weights, quantized_activations, 1, 4);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment