Implementation:Ggml org Ggml Cpu amx mmq
Metadata
| Field | Value |
|---|---|
| Page Type | Implementation (AMX Accelerator) |
| Knowledge Sources | GGML |
| Domains | ML_Infrastructure, Tensor_Computing, CPU_Backend, Quantized_Matrix_Multiplication |
| Last Updated | 2025-05-15 12:00 GMT |
Overview
Implements AMX-accelerated quantized matrix multiplication kernels using Intel AMX tile instructions and AVX-512 VNNI for high-throughput quantized inference.
Description
amx/mmq.cpp provides the performance-critical kernel code that makes Intel AMX acceleration worthwhile, delivering several-fold speedups over scalar/SIMD approaches for supported quantization types. Key components include:
- Template metaprogramming: The
Unroll<n>struct provides compile-time loop unrolling for inner kernels, usingstd::integral_constantfor index dispatch. - Type traits system:
PackedTypesmaps quantization block types to their packed element type (e.g.,block_q4_0 -> int8_t).do_compensate,do_unpack, andis_type_qkkcontrol kernel behavior based on the quantization format. - Dispatch macros:
GGML_DISPATCH_QTYPESinstantiates kernels for q4_0, q4_1, q8_0, q4_K, q5_K, q6_K, iq4_xs.GGML_DISPATCH_FLOATING_TYPEShandles f16 and bf16.GGML_DISPATCH_BOOLspecializes on boolean template parameters. - Micro-kernel classes:
tinygemm_kernel_vnniimplements VNNI-based (Vector Neural Network Instructions) matrix multiply with accumulation.tinygemm_kernel_avxprovides an AVX-512 fallback. - Tile configuration: Uses
_tile_loadconfigto configure AMX tile registers (tile sizes, palette), and_tile_dpbssd/_tile_dpbusdfor tile-based integer dot products. - Weight conversion: Packs quantized weights into AMX-friendly contiguous formats for tile loading.
The entry point ggml_backend_amx_mul_mat handles the complete mul_mat operation including data preparation, kernel dispatch, and result accumulation.
Usage
AMX acceleration is activated automatically on Intel CPUs with AMX-INT8 and AVX-512 VNNI support when the AMX buffer type is registered via the CPU backend interface. Requires the __AMX_INT8__ and __AVX512VNNI__ compiler flags.
Code Reference
Source Location
GGML repo, file: src/ggml-cpu/amx/mmq.cpp (2512 lines).
Signature
// Main entry point for AMX matrix multiplication
void ggml_backend_amx_mul_mat(
const ggml_compute_params * params,
struct ggml_tensor * dst);
Import
#include "amx/amx.h"
#include "amx/mmq.h"
I/O Contract
Inputs
| Parameter | Type | Required | Description |
|---|---|---|---|
params |
const ggml_compute_params * |
Yes | Thread index, thread count for parallel tiled execution. |
dst |
struct ggml_tensor * |
Yes | Destination tensor; dst->src[0] contains the weight matrix (quantized), dst->src[1] contains the input matrix.
|
Outputs
| Output | Type | Description |
|---|---|---|
dst->data |
float * |
Matrix multiplication result in f32 format. |
Usage Examples
AMX Acceleration Flow (Internal)
// AMX mul_mat is triggered automatically when:
// 1. The CPU supports AMX-INT8 and AVX-512 VNNI
// 2. The weight tensor uses the AMX buffer type
// 3. The weight type is a supported quantization format
// The compute engine checks tensor traits:
if (tensor->extra && tensor->extra->work_size) {
// Dispatches to AMX-optimized path
ggml_backend_amx_mul_mat(¶ms, dst);
}
Supported Quantization Types
// The following quantization types are supported:
// - GGML_TYPE_Q4_0, GGML_TYPE_Q4_1, GGML_TYPE_Q8_0
// - GGML_TYPE_Q4_K, GGML_TYPE_Q5_K, GGML_TYPE_Q6_K
// - GGML_TYPE_IQ4_XS
// - GGML_TYPE_F16, GGML_TYPE_BF16 (floating point)
Related Pages
- Ggml_org_Ggml_Cpu_backend_interface -- Registers AMX as an extra buffer type.
- Ggml_org_Ggml_Cpu_quantization -- Quantization primitives used by AMX kernels.
- Ggml_org_Ggml_Cpu_sgemm -- Alternative SGEMM path for non-AMX CPUs.
- Ggml_org_Ggml_Cpu_simd_mappings -- SIMD macros used alongside AMX tile operations.