Implementation:Ggml org Ggml Cpu kleidiai backend
Metadata
| Field | Value |
|---|---|
| Page Type | Implementation (KleidiAI Backend Integration) |
| Knowledge Sources | GGML |
| Domains | ML_Infrastructure, Tensor_Computing, CPU_Backend, Quantized_Matrix_Multiplication |
| Last Updated | 2025-05-15 12:00 GMT |
Overview
Implements the GGML backend integration for Arm's KleidiAI library, providing optimized quantized matrix multiplication with automatic CPU feature detection and kernel selection.
Description
kleidiai/kleidiai.cpp is the main integration point that makes Arm KleidiAI's highly-optimized micro-kernels available as a GGML CPU backend. Key components include:
- Context singleton:
ggml_kleidiai_contextholds detected CPU features and selected q4/q8 kernel sets. - CPU feature detection:
init_kleidiai_context()runs once under a critical section, detecting:- DOTPROD (dot product instructions)
- I8MM (int8 matrix multiply)
- SVE (Scalable Vector Extension, validated that SVE count matches
QK8_0 = 32) - SME (Streaming Matrix Extensions, opt-in via
GGML_KLEIDIAI_SMEenvironment variable)
- Kernel selection: Calls
ggml_kleidiai_select_kernels_q4_0andggml_kleidiai_select_kernels_q8_0to find the best kernel for the detected features. - Tensor traits: Implements
ggml::cpu::kleidiai::tensor_traitsfor:- Work size: Calculates LHS packing buffer size per operation.
- Compute: Performs tiled GEMM/GEMV with LHS quantization packing, RHS access from pre-packed buffers, and multi-threaded dispatch.
- Extra buffer type: Provides
ggml_backend_cpu_kleidiai_buffer_type()that interceptsGGML_OP_MUL_MATon appropriately typed tensors, packs RHS weights into KleidiAI format, and dispatches to optimized kernels. - Data layout helpers:
transpose_f32kxn_f16nxkfor transposing f16 matrices to f32 for LHS packing.
Usage
KleidiAI acceleration is activated automatically on ARM CPUs with dotprod/I8MM/SVE/SME support when the build includes GGML_USE_CPU_KLEIDIAI. The backend registers itself as an extra buffer type.
Code Reference
Source Location
GGML repo, file: src/ggml-cpu/kleidiai/kleidiai.cpp (798 lines).
Signature
// Backend buffer type registration
ggml_backend_buffer_type_t ggml_backend_cpu_kleidiai_buffer_type(void);
// Kernel selection (used internally)
ggml_kleidiai_kernels * ggml_kleidiai_select_kernels(
cpu_feature features, const struct ggml_tensor * op);
Import
#include "kleidiai/kleidiai.h"
I/O Contract
Inputs
| Parameter | Type | Required | Description |
|---|---|---|---|
| CPU features | Hardware detection | Automatic | Detected at init via ggml_cpu_has_dotprod(), ggml_cpu_has_matmul_int8(), ggml_cpu_has_sve(), ggml_cpu_has_sme().
|
GGML_KLEIDIAI_SME |
Environment variable | No | Set to non-zero to enable SME kernels (opt-in due to potential stability considerations). |
op |
const struct ggml_tensor * |
Yes (select) | The mul_mat operation tensor for kernel selection based on weight type. |
Outputs
| Output | Type | Description |
|---|---|---|
| Buffer type | ggml_backend_buffer_type_t |
KleidiAI buffer type for the CPU backend, or NULL if no suitable kernels are available.
|
| Matrix result | float * |
Output of the optimized quantized matrix multiplication. |
Usage Examples
Automatic KleidiAI Activation
#include "ggml-cpu.h"
#include "ggml-backend.h"
// KleidiAI is automatically enabled when building with GGML_USE_CPU_KLEIDIAI
// and running on a supported ARM processor.
// Create CPU backend (KleidiAI buffer type is auto-registered)
ggml_backend_t cpu = ggml_backend_cpu_init();
// Tensors using q4_0 or q8_0 quantization will automatically
// use KleidiAI-optimized matrix multiplication when:
// 1. The CPU supports dotprod, I8MM, SVE, or SME
// 2. The weight tensor is allocated via the KleidiAI buffer type
Enabling SME Kernels
// To enable SME (Streaming Matrix Extensions) kernels:
// Set environment variable before running:
// export GGML_KLEIDIAI_SME=1
Related Pages
- Ggml_org_Ggml_Cpu_kleidiai_kernels -- KleidiAI micro-kernel wrappers used by this backend.
- Ggml_org_Ggml_Cpu_backend_interface -- Registers KleidiAI as an extra buffer type.
- Ggml_org_Ggml_Cpu_amx_mmq -- Intel AMX: analogous accelerated matmul for x86.
- Ggml_org_Ggml_Cpu_weight_repack -- Generic weight repacking (alternative optimization path).