Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Ggml org Ggml Architecture Specific SIMD Quantization

From Leeroopedia


Field Value
sources GGML ARM NEON Intrinsics Intel Intrinsics Guide
domains SIMD, Quantization, Performance
last_updated 2026-02-10

Overview

Architecture-Specific SIMD Quantization is the principle of providing platform-native, vectorized implementations of quantization and dequantization routines that exploit the specific SIMD instruction sets available on each target CPU architecture.

Description

GGML supports a wide variety of quantization formats (Q4_0, Q4_1, Q5_0, Q5_1, Q8_0, Q8_1, Q2_K through Q6_K, IQ formats, etc.) that reduce model weight storage from 32-bit or 16-bit floating point down to 2-8 bits per weight. The critical hot path in inference is the quantized dot product -- computing the dot product of a quantized weight vector with a floating-point activation vector -- which is invoked billions of times during matrix multiplication.

To achieve maximum throughput, GGML provides architecture-specific implementations of these dot product and dequantization routines for every major CPU SIMD instruction set:

Architecture SIMD ISA Directory
ARM (AArch64/AArch32) NEON, SVE, MATMUL_INT8, SME src/ggml-cpu/arch/arm/
x86_64 / x86 SSE, AVX, AVX2, AVX-512, VNNI, AVX10/AMX src/ggml-cpu/arch/x86/
POWER (ppc64le) VSX (Vector-Scalar Extension) src/ggml-cpu/arch/powerpc/
RISC-V RVV (RISC-V Vector Extension) src/ggml-cpu/arch/riscv/
WebAssembly WASM SIMD128 src/ggml-cpu/arch/wasm/
LoongArch LSX/LASX src/ggml-cpu/arch/loongarch/
s390x VXE (Vector Extension) src/ggml-cpu/arch/s390/

Each architecture directory contains:

  • quants.c -- Quantized dot product implementations (ggml_vec_dot_*) using native SIMD intrinsics
  • repack.c/cpp (where applicable) -- Functions to rearrange quantized data into layouts optimal for the architecture's vector register width and multiply-accumulate patterns
  • cpu-feats.cpp (where applicable) -- Runtime CPU feature detection to select the best code path

The implementations share a common algorithm but differ dramatically in their use of intrinsics. For example, the Q4_0 dot product on ARM NEON uses vld1q_u8/vmovl/vsubq to load and dequantize 4-bit values into int16 lanes, then vdotq_s32 (on ARMv8.2+) for the multiply-accumulate, while the AVX2 version uses _mm256_loadu_si256/_mm256_maddubs_epi16/_mm256_madd_epi16 for the equivalent operation.

A fallback scalar implementation (arch-fallback.h) is provided for platforms without SIMD support.

Usage

This principle applies whenever:

  • A quantized model is run on a CPU backend (the most common inference scenario)
  • The target CPU supports any level of SIMD instructions
  • Performance-sensitive quantized dot products need to saturate available vector throughput

The architecture is selected at compile time via preprocessor defines (e.g., __ARM_NEON, __AVX2__, __wasm_simd128__), with runtime feature detection used within an architecture family to select among sub-variants (e.g., choosing VNNI-accelerated paths on supported Intel CPUs).

Theoretical Basis

The fundamental operation being optimized is the quantized dot product. Consider the Q4_0 format as a representative example:

 Q4_0 Block Structure:
 - 32 weights packed into 16 bytes (4 bits each)
 - 1 float16 scale factor (d)
 - Each weight w_i = (nibble_i - 8) * d
 Scalar Dot Product (reference):
 function dot_q4_0(block[] A, float[] B, int n_blocks):
   sum = 0.0
   for each block b in A:
     d = float(b.scale)
     for i in 0..31:
       nibble = extract_4bit(b.data, i)
       weight = (nibble - 8) * d
       sum += weight * B[block_offset + i]
   return sum
 SIMD-Optimized Dot Product (conceptual, 256-bit vectors):
 function dot_q4_0_simd(block[] A, float[] B, int n_blocks):
   accumulator = vector_zero()     // 8x int32 or 8x float32
   for each block b in A:
     // Load 32 x 4-bit weights as 16 bytes
     raw = vector_load_128bit(b.data)
     // Unpack low/high nibbles into two 128-bit vectors of 16x uint8
     lo_nibbles = raw AND 0x0F        // lower 4 bits of each byte
     hi_nibbles = (raw >> 4) AND 0x0F // upper 4 bits of each byte
     // Interleave into 32 x int8 values, subtract bias of 8
     weights_i8 = interleave(lo_nibbles, hi_nibbles) - 8
     // Load 32 x float activations, convert to int8 (after scaling)
     // OR: use integer multiply-accumulate, then scale at the end
     acts = vector_load(B + offset)
     // Multiply-accumulate using widening integer multiply
     // e.g., maddubs (unsigned*signed -> int16), then madd (int16*1 -> int32)
     partial = vector_multiply_accumulate(weights_i8, acts_i8)
     // Scale by block scale factor and accumulate
     accumulator += partial * broadcast(b.scale)
   return horizontal_sum(accumulator)

The key optimizations that SIMD enables:

  1. Data parallelism -- Process 16-32 weight values per instruction instead of one at a time
  2. Widening multiply-accumulate -- Native instructions (NEON sdot/udot, AVX2 maddubs) perform multiply and add in a single cycle
  3. Reduced memory bandwidth -- Quantized data is smaller, so more weights fit in cache lines and vector loads
  4. Fused operations -- Combine nibble extraction, bias subtraction, and multiply into minimal instruction sequences

The performance difference between scalar and SIMD implementations is typically 8-16x, making this optimization essential for practical CPU inference.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment