Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Ggml org Ggml Quantized Matrix Multiplication

From Leeroopedia


Attribute Value
Page Type Principle
Full Name Ggml_org_Ggml_Quantized_Matrix_Multiplication
Short Name Quantized_Matrix_Multiplication
Domain Tags Linear_Algebra, Quantization, Performance
Knowledge Source GGML
Last Updated 2026-02-10

Overview

Performing matrix multiplication directly on quantized data using specialized kernels that avoid full dequantization, exploiting hardware-specific instructions like Intel AMX tiles, ARM NEON/SVE dot products, and RISC-V vector extensions.

Description

Quantized Matrix Multiplication is the principle of computing matrix products directly on compressed (quantized) weight representations without first converting them back to full-precision floating point. In standard inference pipelines, quantized weights would be dequantized to f32, multiplied using conventional BLAS routines, and the results accumulated in floating point. Quantized matrix multiplication eliminates the dequantization step by using kernels that understand the quantized block format natively, performing the multiplication and accumulation using integer or mixed-precision arithmetic directly on the packed data.

GGML implements this principle through multiple specialized kernel families:

  • AMX (Advanced Matrix Extensions): Intel's AMX tile instructions perform 8-bit integer matrix multiplication on 16x16 tiles directly in hardware. The AMX MMQ kernels in GGML load quantized blocks into AMX tiles, perform tile-level multiply-accumulate operations using _tile_dpbssd (signed 8-bit dot product), and accumulate results in 32-bit integers. The kernels support Q4_0, Q4_1, Q5_0, Q5_1, Q8_0, Q2_K through Q6_K, and IQ4_NL formats. Template metaprogramming with forced unrolling (Unroll<N>) is used to generate optimized code for each quantization type.
  • tinyBLAS (sgemm): The tinyBLAS kernel library (from Mozilla's llamafile project) provides cache-optimized matrix multiplication kernels that operate on quantized types. These kernels are designed for matrices that fit in CPU cache, avoiding the overhead of workspace allocation. They support AVX-512, AVX2, ARM NEON, RISC-V Vector, and IBM VSX/VXE instruction sets, with the register file size determining tiling parameters (VECTOR_REGISTERS is 32 on NEON/AVX-512/VXE, 16 otherwise).
  • KleidiAI: Arm's KleidiAI library provides optimized quantized GEMM kernels for ARM processors with DOTPROD, I8MM, SVE, or SME extensions. The GGML KleidiAI integration detects available CPU features at runtime and selects the appropriate kernel variant. It supports Q4_0 and Q8_0 quantization formats with hardware-accelerated dot product instructions.
  • SpacemiT IME: Specialized kernels for the SpacemiT RISC-V processor's Integer Matrix Extension, targeting quantized matrix operations on RISC-V hardware with custom matrix acceleration instructions.

Usage

Quantized matrix multiplication is applied during the inference of quantized models:

  • LLM inference with quantized weights: When model weights are stored in Q4_0, Q4_K, Q8_0, or other quantized formats, QMM kernels compute the forward pass without dequantization overhead.
  • Memory-bandwidth-limited scenarios: On CPUs where memory bandwidth is the bottleneck, QMM reduces the amount of data read from memory (quantized blocks are 2-8x smaller than f32) while performing computation at the same time.
  • Hardware-specific acceleration: When running on processors with AMX (Intel Sapphire Rapids+), I8MM/SVE/SME (ARM), or RVV (RISC-V), QMM kernels exploit these instructions for throughput that exceeds what scalar dequantize-then-multiply could achieve.
  • Batched inference: The GEMM variants (as opposed to GEMV for single-vector inference) handle batched queries where multiple input vectors are multiplied against the weight matrix simultaneously.

Theoretical Basis

Integer Arithmetic for Neural Networks

Neural network weights, after quantization, can be represented as integers with per-block scale factors. The matrix product C = A * B where A contains quantized integers and per-block scales can be decomposed as: for each block, compute the integer dot product of the quantized values, then multiply by the product of the two blocks' scale factors. This decomposition allows the bulk of the computation (the dot products) to use fast integer arithmetic, with only a small number of floating-point scale multiplications per block.

Tile-Based Matrix Multiplication

Modern hardware accelerators (Intel AMX, ARM SME) provide tile-level matrix multiplication instructions that compute an entire small matrix product (e.g., 16x16 times 16x16) in a single instruction or small instruction sequence. The GGML AMX kernels pack quantized data into tile registers, issue tile multiply-accumulate instructions, and extract the results. The tiling strategy must account for the block structure of the quantized format: each quantization block (e.g., 32 elements for Q4_0) must align with tile boundaries for efficient loading.

Cache-Optimized Kernel Design

The tinyBLAS kernels are specifically designed for matrices that fit in CPU cache. Traditional BLAS libraries use workspace buffers and complex tiling strategies for large matrices; tinyBLAS instead targets the common case in LLM inference where at least one matrix dimension is small (single token or small batch), and the working set fits in L1/L2 cache. This eliminates malloc overhead and cache-filling passes, reducing per-call latency.

Fused Dequantization and Accumulation

Rather than maintaining separate dequantize and multiply steps, QMM kernels fuse these operations. For each pair of quantized blocks, the kernel unpacks the quantized values (using bit manipulation -- shifts and masks for 4-bit values, direct loads for 8-bit values), computes the dot product in integer or narrow floating-point precision, and immediately accumulates the scaled result into the output. This fusion reduces memory traffic (intermediate dequantized values never touch main memory) and instruction count.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment