Implementation:Ggml org Ggml Cpu amx mmq

Metadata

Field	Value
Page Type	Implementation (AMX Accelerator)
Knowledge Sources	GGML
Domains	ML_Infrastructure, Tensor_Computing, CPU_Backend, Quantized_Matrix_Multiplication
Last Updated	2025-05-15 12:00 GMT

Overview

Implements AMX-accelerated quantized matrix multiplication kernels using Intel AMX tile instructions and AVX-512 VNNI for high-throughput quantized inference.

Description

amx/mmq.cpp provides the performance-critical kernel code that makes Intel AMX acceleration worthwhile, delivering several-fold speedups over scalar/SIMD approaches for supported quantization types. Key components include:

Template metaprogramming: The Unroll<n> struct provides compile-time loop unrolling for inner kernels, using std::integral_constant for index dispatch.
Type traits system: PackedTypes maps quantization block types to their packed element type (e.g., block_q4_0 -> int8_t). do_compensate, do_unpack, and is_type_qkk control kernel behavior based on the quantization format.
Dispatch macros: GGML_DISPATCH_QTYPES instantiates kernels for q4_0, q4_1, q8_0, q4_K, q5_K, q6_K, iq4_xs. GGML_DISPATCH_FLOATING_TYPES handles f16 and bf16. GGML_DISPATCH_BOOL specializes on boolean template parameters.
Micro-kernel classes: tinygemm_kernel_vnni implements VNNI-based (Vector Neural Network Instructions) matrix multiply with accumulation. tinygemm_kernel_avx provides an AVX-512 fallback.
Tile configuration: Uses _tile_loadconfig to configure AMX tile registers (tile sizes, palette), and _tile_dpbssd/_tile_dpbusd for tile-based integer dot products.
Weight conversion: Packs quantized weights into AMX-friendly contiguous formats for tile loading.

The entry point ggml_backend_amx_mul_mat handles the complete mul_mat operation including data preparation, kernel dispatch, and result accumulation.

Usage

AMX acceleration is activated automatically on Intel CPUs with AMX-INT8 and AVX-512 VNNI support when the AMX buffer type is registered via the CPU backend interface. Requires the __AMX_INT8__ and __AVX512VNNI__ compiler flags.

Code Reference

Source Location

GGML repo, file: src/ggml-cpu/amx/mmq.cpp (2512 lines).

Signature

// Main entry point for AMX matrix multiplication
void ggml_backend_amx_mul_mat(
    const ggml_compute_params * params,
    struct ggml_tensor * dst);

Import

#include "amx/amx.h"
#include "amx/mmq.h"

I/O Contract

Inputs

Parameter	Type	Required	Description
`params`	`const ggml_compute_params *`	Yes	Thread index, thread count for parallel tiled execution.
`dst`	`struct ggml_tensor *`	Yes	Destination tensor; `dst->src[0]` contains the weight matrix (quantized), `dst->src[1]` contains the input matrix.

Outputs

Output	Type	Description
`dst->data`	`float *`	Matrix multiplication result in f32 format.

Usage Examples

AMX Acceleration Flow (Internal)

// AMX mul_mat is triggered automatically when:
// 1. The CPU supports AMX-INT8 and AVX-512 VNNI
// 2. The weight tensor uses the AMX buffer type
// 3. The weight type is a supported quantization format

// The compute engine checks tensor traits:
if (tensor->extra && tensor->extra->work_size) {
    // Dispatches to AMX-optimized path
    ggml_backend_amx_mul_mat(&params, dst);
}

Supported Quantization Types

// The following quantization types are supported:
// - GGML_TYPE_Q4_0, GGML_TYPE_Q4_1, GGML_TYPE_Q8_0
// - GGML_TYPE_Q4_K, GGML_TYPE_Q5_K, GGML_TYPE_Q6_K
// - GGML_TYPE_IQ4_XS
// - GGML_TYPE_F16, GGML_TYPE_BF16 (floating point)

Related Pages

Ggml_org_Ggml_Cpu_backend_interface -- Registers AMX as an extra buffer type.
Ggml_org_Ggml_Cpu_quantization -- Quantization primitives used by AMX kernels.
Ggml_org_Ggml_Cpu_sgemm -- Alternative SGEMM path for non-AMX CPUs.
Ggml_org_Ggml_Cpu_simd_mappings -- SIMD macros used alongside AMX tile operations.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment