Implementation:Ggml org Ggml Cpu sgemm

Metadata

Field	Value
Page Type	Implementation (SGEMM / tinyBLAS)
Knowledge Sources	GGML
Domains	ML_Infrastructure, Tensor_Computing, CPU_Backend, Quantized_Matrix_Multiplication
Last Updated	2025-05-15 12:00 GMT

Overview

Implements high-performance multithreaded CPU matrix multiplication (SGEMM) using tinyBLAS kernels optimized for cache-resident matrices, supporting both float and quantized formats.

Description

llamafile/sgemm.cpp originated from Mozilla's llamafile/tinyBLAS project and is a critical performance component providing optimized BLAS-like matrix multiplication without external library dependencies, designed specifically for LLM inference workloads. Key components include:

Architecture-specific tinyBLAS classes:
- tinyBLAS -- Generic SIMD for SSE/AVX/AVX-512/NEON with f32/f16/bf16 support.
- tinyBLAS_Q0_ARM -- Quantized ARM NEON with q4_0/q8_0 dot products using vdotq_s32 and vmmlaq_s32.
- tinyBLAS_Q0_AVX -- Quantized x86 AVX2/AVX-512 with _mm256_maddubs_epi16 for quantized dot products.
- tinyBLAS_RVV -- RISC-V vector intrinsics with variable-length vector processing.
- tinyBLAS_PPC / tinyBLAS_HP16_PPC -- PowerPC MMA (Matrix-Multiply Assist) via sgemm-ppc.h.
Vectorized arithmetic: Overloaded add/sub/mul/madd functions for SSE (__m128), AVX (__m256), AVX-512 (__m512), NEON (float32x4_t), and s390x VXE (float32x4_t).
FMA specialization: Template specialization of madd to use hardware FMA when available (_mm256_fmadd_ps, _mm512_fmadd_ps, AVX-512 BF16 _mm512_dpbf16_ps).
Register blocking: Uses compile-time VECTOR_REGISTERS (16 for x86, 32 for ARM/AVX-512) to size the inner kernel tile for optimal register utilization.
Entry point: llamafile_sgemm dispatches to the appropriate class based on data types (f32, f16, bf16, q4_0, q8_0, q4_K, q5_K, q6_K, iq4_nl) and detected architecture features.
Transpose convention: Computes C = A^T * B which matches the common contiguous layout in GGML.

Usage

Called from the compute engine during matrix multiply dispatch when GGML_USE_LLAMAFILE is enabled (disabled on ARM with SVE or MATMUL_INT8, which prefer KleidiAI).

Code Reference

Source Location

GGML repo, file: src/ggml-cpu/llamafile/sgemm.cpp (3681 lines).

Signature

bool llamafile_sgemm(
    const struct ggml_compute_params * params,
    int64_t m, int64_t n, int64_t k,
    const void * A, int64_t lda,
    const void * B, int64_t ldb,
    float * C, int64_t ldc,
    int Atype, int Btype);

Import

#include "llamafile/sgemm.h"

I/O Contract

Inputs

Parameter	Type	Required	Description
`params`	`const ggml_compute_params *`	Yes	Thread index and count for parallel execution.
`m, n, k`	`int64_t`	Yes	Matrix dimensions: result is m x n, inner dimension k.
`A`	`const void *`	Yes	First input matrix (may be quantized).
`B`	`const void *`	Yes	Second input matrix (may be quantized).
`lda, ldb, ldc`	`int64_t`	Yes	Leading dimensions in bytes or elements depending on type.
`Atype, Btype`	`int`	Yes	GGML type enums for A and B matrices.

Outputs

Output	Type	Description
Return value	`bool`	`true` if the operation was handled by llamafile SGEMM, `false` if the type combination is unsupported (fallback to default path).
`C`	`float *`	Result matrix: `C = A^T * B` in f32 format.

Usage Examples

SGEMM Dispatch in Compute Engine (Internal)

#ifdef GGML_USE_LLAMAFILE
// Try llamafile SGEMM first
bool handled = llamafile_sgemm(&params,
    m, n, k,
    src0->data, nb01 / ggml_type_size(src0->type),
    src1->data, nb11 / ggml_type_size(src1->type),
    (float *)dst->data, dst->nb[1] / sizeof(float),
    src0->type, src1->type);

if (!handled) {
    // Fall back to default matrix multiply
    ggml_compute_forward_mul_mat_default(&params, dst);
}
#endif

Related Pages

Ggml_org_Ggml_Cpu_sgemm_ppc -- PowerPC MMA specialization included by this file.
Ggml_org_Ggml_Cpu_compute_engine -- The compute engine that dispatches to llamafile_sgemm.
Ggml_org_Ggml_Cpu_amx_mmq -- Intel AMX: alternative accelerated matmul path.
Ggml_org_Ggml_Cpu_kleidiai_backend -- ARM KleidiAI: alternative accelerated matmul path.
Ggml_org_Ggml_Cpu_quantization -- Quantization types supported by SGEMM kernels.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment