Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ggml org Ggml Cpu amx mmq

From Leeroopedia


Metadata

Field Value
Page Type Implementation (AMX Accelerator)
Knowledge Sources GGML
Domains ML_Infrastructure, Tensor_Computing, CPU_Backend, Quantized_Matrix_Multiplication
Last Updated 2025-05-15 12:00 GMT

Overview

Implements AMX-accelerated quantized matrix multiplication kernels using Intel AMX tile instructions and AVX-512 VNNI for high-throughput quantized inference.

Description

amx/mmq.cpp provides the performance-critical kernel code that makes Intel AMX acceleration worthwhile, delivering several-fold speedups over scalar/SIMD approaches for supported quantization types. Key components include:

  1. Template metaprogramming: The Unroll<n> struct provides compile-time loop unrolling for inner kernels, using std::integral_constant for index dispatch.
  2. Type traits system: PackedTypes maps quantization block types to their packed element type (e.g., block_q4_0 -> int8_t). do_compensate, do_unpack, and is_type_qkk control kernel behavior based on the quantization format.
  3. Dispatch macros: GGML_DISPATCH_QTYPES instantiates kernels for q4_0, q4_1, q8_0, q4_K, q5_K, q6_K, iq4_xs. GGML_DISPATCH_FLOATING_TYPES handles f16 and bf16. GGML_DISPATCH_BOOL specializes on boolean template parameters.
  4. Micro-kernel classes: tinygemm_kernel_vnni implements VNNI-based (Vector Neural Network Instructions) matrix multiply with accumulation. tinygemm_kernel_avx provides an AVX-512 fallback.
  5. Tile configuration: Uses _tile_loadconfig to configure AMX tile registers (tile sizes, palette), and _tile_dpbssd/_tile_dpbusd for tile-based integer dot products.
  6. Weight conversion: Packs quantized weights into AMX-friendly contiguous formats for tile loading.

The entry point ggml_backend_amx_mul_mat handles the complete mul_mat operation including data preparation, kernel dispatch, and result accumulation.

Usage

AMX acceleration is activated automatically on Intel CPUs with AMX-INT8 and AVX-512 VNNI support when the AMX buffer type is registered via the CPU backend interface. Requires the __AMX_INT8__ and __AVX512VNNI__ compiler flags.

Code Reference

Source Location

GGML repo, file: src/ggml-cpu/amx/mmq.cpp (2512 lines).

Signature

// Main entry point for AMX matrix multiplication
void ggml_backend_amx_mul_mat(
    const ggml_compute_params * params,
    struct ggml_tensor * dst);

Import

#include "amx/amx.h"
#include "amx/mmq.h"

I/O Contract

Inputs

Parameter Type Required Description
params const ggml_compute_params * Yes Thread index, thread count for parallel tiled execution.
dst struct ggml_tensor * Yes Destination tensor; dst->src[0] contains the weight matrix (quantized), dst->src[1] contains the input matrix.

Outputs

Output Type Description
dst->data float * Matrix multiplication result in f32 format.

Usage Examples

AMX Acceleration Flow (Internal)

// AMX mul_mat is triggered automatically when:
// 1. The CPU supports AMX-INT8 and AVX-512 VNNI
// 2. The weight tensor uses the AMX buffer type
// 3. The weight type is a supported quantization format

// The compute engine checks tensor traits:
if (tensor->extra && tensor->extra->work_size) {
    // Dispatches to AMX-optimized path
    ggml_backend_amx_mul_mat(&params, dst);
}

Supported Quantization Types

// The following quantization types are supported:
// - GGML_TYPE_Q4_0, GGML_TYPE_Q4_1, GGML_TYPE_Q8_0
// - GGML_TYPE_Q4_K, GGML_TYPE_Q5_K, GGML_TYPE_Q6_K
// - GGML_TYPE_IQ4_XS
// - GGML_TYPE_F16, GGML_TYPE_BF16 (floating point)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment