Principle:Ggml org Ggml Architecture Specific SIMD Quantization
| Field | Value |
|---|---|
| sources | GGML ARM NEON Intrinsics Intel Intrinsics Guide |
| domains | SIMD, Quantization, Performance |
| last_updated | 2026-02-10 |
Overview
Architecture-Specific SIMD Quantization is the principle of providing platform-native, vectorized implementations of quantization and dequantization routines that exploit the specific SIMD instruction sets available on each target CPU architecture.
Description
GGML supports a wide variety of quantization formats (Q4_0, Q4_1, Q5_0, Q5_1, Q8_0, Q8_1, Q2_K through Q6_K, IQ formats, etc.) that reduce model weight storage from 32-bit or 16-bit floating point down to 2-8 bits per weight. The critical hot path in inference is the quantized dot product -- computing the dot product of a quantized weight vector with a floating-point activation vector -- which is invoked billions of times during matrix multiplication.
To achieve maximum throughput, GGML provides architecture-specific implementations of these dot product and dequantization routines for every major CPU SIMD instruction set:
| Architecture | SIMD ISA | Directory |
|---|---|---|
| ARM (AArch64/AArch32) | NEON, SVE, MATMUL_INT8, SME | src/ggml-cpu/arch/arm/ |
| x86_64 / x86 | SSE, AVX, AVX2, AVX-512, VNNI, AVX10/AMX | src/ggml-cpu/arch/x86/ |
| POWER (ppc64le) | VSX (Vector-Scalar Extension) | src/ggml-cpu/arch/powerpc/ |
| RISC-V | RVV (RISC-V Vector Extension) | src/ggml-cpu/arch/riscv/ |
| WebAssembly | WASM SIMD128 | src/ggml-cpu/arch/wasm/ |
| LoongArch | LSX/LASX | src/ggml-cpu/arch/loongarch/ |
| s390x | VXE (Vector Extension) | src/ggml-cpu/arch/s390/ |
Each architecture directory contains:
- quants.c -- Quantized dot product implementations (ggml_vec_dot_*) using native SIMD intrinsics
- repack.c/cpp (where applicable) -- Functions to rearrange quantized data into layouts optimal for the architecture's vector register width and multiply-accumulate patterns
- cpu-feats.cpp (where applicable) -- Runtime CPU feature detection to select the best code path
The implementations share a common algorithm but differ dramatically in their use of intrinsics. For example, the Q4_0 dot product on ARM NEON uses vld1q_u8/vmovl/vsubq to load and dequantize 4-bit values into int16 lanes, then vdotq_s32 (on ARMv8.2+) for the multiply-accumulate, while the AVX2 version uses _mm256_loadu_si256/_mm256_maddubs_epi16/_mm256_madd_epi16 for the equivalent operation.
A fallback scalar implementation (arch-fallback.h) is provided for platforms without SIMD support.
Usage
This principle applies whenever:
- A quantized model is run on a CPU backend (the most common inference scenario)
- The target CPU supports any level of SIMD instructions
- Performance-sensitive quantized dot products need to saturate available vector throughput
The architecture is selected at compile time via preprocessor defines (e.g., __ARM_NEON, __AVX2__, __wasm_simd128__), with runtime feature detection used within an architecture family to select among sub-variants (e.g., choosing VNNI-accelerated paths on supported Intel CPUs).
Theoretical Basis
The fundamental operation being optimized is the quantized dot product. Consider the Q4_0 format as a representative example:
Q4_0 Block Structure: - 32 weights packed into 16 bytes (4 bits each) - 1 float16 scale factor (d) - Each weight w_i = (nibble_i - 8) * d
Scalar Dot Product (reference):
function dot_q4_0(block[] A, float[] B, int n_blocks):
sum = 0.0
for each block b in A:
d = float(b.scale)
for i in 0..31:
nibble = extract_4bit(b.data, i)
weight = (nibble - 8) * d
sum += weight * B[block_offset + i]
return sum
SIMD-Optimized Dot Product (conceptual, 256-bit vectors): function dot_q4_0_simd(block[] A, float[] B, int n_blocks): accumulator = vector_zero() // 8x int32 or 8x float32
for each block b in A:
// Load 32 x 4-bit weights as 16 bytes
raw = vector_load_128bit(b.data)
// Unpack low/high nibbles into two 128-bit vectors of 16x uint8
lo_nibbles = raw AND 0x0F // lower 4 bits of each byte
hi_nibbles = (raw >> 4) AND 0x0F // upper 4 bits of each byte
// Interleave into 32 x int8 values, subtract bias of 8
weights_i8 = interleave(lo_nibbles, hi_nibbles) - 8
// Load 32 x float activations, convert to int8 (after scaling)
// OR: use integer multiply-accumulate, then scale at the end
acts = vector_load(B + offset)
// Multiply-accumulate using widening integer multiply
// e.g., maddubs (unsigned*signed -> int16), then madd (int16*1 -> int32)
partial = vector_multiply_accumulate(weights_i8, acts_i8)
// Scale by block scale factor and accumulate
accumulator += partial * broadcast(b.scale)
return horizontal_sum(accumulator)
The key optimizations that SIMD enables:
- Data parallelism -- Process 16-32 weight values per instruction instead of one at a time
- Widening multiply-accumulate -- Native instructions (NEON sdot/udot, AVX2 maddubs) perform multiply and add in a single cycle
- Reduced memory bandwidth -- Quantized data is smaller, so more weights fit in cache lines and vector loads
- Fused operations -- Combine nibble extraction, bias subtraction, and multiply into minimal instruction sequences
The performance difference between scalar and SIMD implementations is typically 8-16x, making this optimization essential for practical CPU inference.
Related Pages
- Implementation:Ggml_org_Ggml_Cpu_arm_quants
- Ggml_org_Ggml_Cpu_compute_engine -- The CPU backend that invokes these SIMD-optimized routines
- Ggml_org_Ggml_BLAS_Matrix_Multiplication -- Alternative path that dequantizes to float32 and uses BLAS
- Ggml_org_Ggml_CPU_Compute_Engine -- The higher-level CPU execution principle