Implementation:Ggml org Ggml Cpu kleidiai backend

Metadata

Field	Value
Page Type	Implementation (KleidiAI Backend Integration)
Knowledge Sources	GGML
Domains	ML_Infrastructure, Tensor_Computing, CPU_Backend, Quantized_Matrix_Multiplication
Last Updated	2025-05-15 12:00 GMT

Overview

Implements the GGML backend integration for Arm's KleidiAI library, providing optimized quantized matrix multiplication with automatic CPU feature detection and kernel selection.

Description

kleidiai/kleidiai.cpp is the main integration point that makes Arm KleidiAI's highly-optimized micro-kernels available as a GGML CPU backend. Key components include:

Context singleton: ggml_kleidiai_context holds detected CPU features and selected q4/q8 kernel sets.
CPU feature detection: init_kleidiai_context() runs once under a critical section, detecting:
- DOTPROD (dot product instructions)
- I8MM (int8 matrix multiply)
- SVE (Scalable Vector Extension, validated that SVE count matches QK8_0 = 32)
- SME (Streaming Matrix Extensions, opt-in via GGML_KLEIDIAI_SME environment variable)
Kernel selection: Calls ggml_kleidiai_select_kernels_q4_0 and ggml_kleidiai_select_kernels_q8_0 to find the best kernel for the detected features.
Tensor traits: Implements ggml::cpu::kleidiai::tensor_traits for:
- Work size: Calculates LHS packing buffer size per operation.
- Compute: Performs tiled GEMM/GEMV with LHS quantization packing, RHS access from pre-packed buffers, and multi-threaded dispatch.
Extra buffer type: Provides ggml_backend_cpu_kleidiai_buffer_type() that intercepts GGML_OP_MUL_MAT on appropriately typed tensors, packs RHS weights into KleidiAI format, and dispatches to optimized kernels.
Data layout helpers: transpose_f32kxn_f16nxk for transposing f16 matrices to f32 for LHS packing.

Usage

KleidiAI acceleration is activated automatically on ARM CPUs with dotprod/I8MM/SVE/SME support when the build includes GGML_USE_CPU_KLEIDIAI. The backend registers itself as an extra buffer type.

Code Reference

Source Location

GGML repo, file: src/ggml-cpu/kleidiai/kleidiai.cpp (798 lines).

Signature

// Backend buffer type registration
ggml_backend_buffer_type_t ggml_backend_cpu_kleidiai_buffer_type(void);

// Kernel selection (used internally)
ggml_kleidiai_kernels * ggml_kleidiai_select_kernels(
    cpu_feature features, const struct ggml_tensor * op);

Import

#include "kleidiai/kleidiai.h"

I/O Contract

Inputs

Parameter	Type	Required	Description
CPU features	Hardware detection	Automatic	Detected at init via `ggml_cpu_has_dotprod()`, `ggml_cpu_has_matmul_int8()`, `ggml_cpu_has_sve()`, `ggml_cpu_has_sme()`.
`GGML_KLEIDIAI_SME`	Environment variable	No	Set to non-zero to enable SME kernels (opt-in due to potential stability considerations).
`op`	`const struct ggml_tensor *`	Yes (select)	The mul_mat operation tensor for kernel selection based on weight type.

Outputs

Output	Type	Description
Buffer type	`ggml_backend_buffer_type_t`	KleidiAI buffer type for the CPU backend, or `NULL` if no suitable kernels are available.
Matrix result	`float *`	Output of the optimized quantized matrix multiplication.

Usage Examples

Automatic KleidiAI Activation

#include "ggml-cpu.h"
#include "ggml-backend.h"

// KleidiAI is automatically enabled when building with GGML_USE_CPU_KLEIDIAI
// and running on a supported ARM processor.

// Create CPU backend (KleidiAI buffer type is auto-registered)
ggml_backend_t cpu = ggml_backend_cpu_init();

// Tensors using q4_0 or q8_0 quantization will automatically
// use KleidiAI-optimized matrix multiplication when:
// 1. The CPU supports dotprod, I8MM, SVE, or SME
// 2. The weight tensor is allocated via the KleidiAI buffer type

Enabling SME Kernels

// To enable SME (Streaming Matrix Extensions) kernels:
// Set environment variable before running:
// export GGML_KLEIDIAI_SME=1

Related Pages

Ggml_org_Ggml_Cpu_kleidiai_kernels -- KleidiAI micro-kernel wrappers used by this backend.
Ggml_org_Ggml_Cpu_backend_interface -- Registers KleidiAI as an extra buffer type.
Ggml_org_Ggml_Cpu_amx_mmq -- Intel AMX: analogous accelerated matmul for x86.
Ggml_org_Ggml_Cpu_weight_repack -- Generic weight repacking (alternative optimization path).

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment