Implementation:Ggml org Ggml Cpu sgemm
Metadata
| Field | Value |
|---|---|
| Page Type | Implementation (SGEMM / tinyBLAS) |
| Knowledge Sources | GGML |
| Domains | ML_Infrastructure, Tensor_Computing, CPU_Backend, Quantized_Matrix_Multiplication |
| Last Updated | 2025-05-15 12:00 GMT |
Overview
Implements high-performance multithreaded CPU matrix multiplication (SGEMM) using tinyBLAS kernels optimized for cache-resident matrices, supporting both float and quantized formats.
Description
llamafile/sgemm.cpp originated from Mozilla's llamafile/tinyBLAS project and is a critical performance component providing optimized BLAS-like matrix multiplication without external library dependencies, designed specifically for LLM inference workloads. Key components include:
- Architecture-specific tinyBLAS classes:
tinyBLAS-- Generic SIMD for SSE/AVX/AVX-512/NEON with f32/f16/bf16 support.tinyBLAS_Q0_ARM-- Quantized ARM NEON with q4_0/q8_0 dot products usingvdotq_s32andvmmlaq_s32.tinyBLAS_Q0_AVX-- Quantized x86 AVX2/AVX-512 with_mm256_maddubs_epi16for quantized dot products.tinyBLAS_RVV-- RISC-V vector intrinsics with variable-length vector processing.tinyBLAS_PPC/tinyBLAS_HP16_PPC-- PowerPC MMA (Matrix-Multiply Assist) viasgemm-ppc.h.
- Vectorized arithmetic: Overloaded
add/sub/mul/maddfunctions for SSE (__m128), AVX (__m256), AVX-512 (__m512), NEON (float32x4_t), and s390x VXE (float32x4_t). - FMA specialization: Template specialization of
maddto use hardware FMA when available (_mm256_fmadd_ps,_mm512_fmadd_ps, AVX-512 BF16_mm512_dpbf16_ps). - Register blocking: Uses compile-time
VECTOR_REGISTERS(16 for x86, 32 for ARM/AVX-512) to size the inner kernel tile for optimal register utilization. - Entry point:
llamafile_sgemmdispatches to the appropriate class based on data types (f32, f16, bf16, q4_0, q8_0, q4_K, q5_K, q6_K, iq4_nl) and detected architecture features. - Transpose convention: Computes
C = A^T * Bwhich matches the common contiguous layout in GGML.
Usage
Called from the compute engine during matrix multiply dispatch when GGML_USE_LLAMAFILE is enabled (disabled on ARM with SVE or MATMUL_INT8, which prefer KleidiAI).
Code Reference
Source Location
GGML repo, file: src/ggml-cpu/llamafile/sgemm.cpp (3681 lines).
Signature
bool llamafile_sgemm(
const struct ggml_compute_params * params,
int64_t m, int64_t n, int64_t k,
const void * A, int64_t lda,
const void * B, int64_t ldb,
float * C, int64_t ldc,
int Atype, int Btype);
Import
#include "llamafile/sgemm.h"
I/O Contract
Inputs
| Parameter | Type | Required | Description |
|---|---|---|---|
params |
const ggml_compute_params * |
Yes | Thread index and count for parallel execution. |
m, n, k |
int64_t |
Yes | Matrix dimensions: result is m x n, inner dimension k. |
A |
const void * |
Yes | First input matrix (may be quantized). |
B |
const void * |
Yes | Second input matrix (may be quantized). |
lda, ldb, ldc |
int64_t |
Yes | Leading dimensions in bytes or elements depending on type. |
Atype, Btype |
int |
Yes | GGML type enums for A and B matrices. |
Outputs
| Output | Type | Description |
|---|---|---|
| Return value | bool |
true if the operation was handled by llamafile SGEMM, false if the type combination is unsupported (fallback to default path).
|
C |
float * |
Result matrix: C = A^T * B in f32 format.
|
Usage Examples
SGEMM Dispatch in Compute Engine (Internal)
#ifdef GGML_USE_LLAMAFILE
// Try llamafile SGEMM first
bool handled = llamafile_sgemm(¶ms,
m, n, k,
src0->data, nb01 / ggml_type_size(src0->type),
src1->data, nb11 / ggml_type_size(src1->type),
(float *)dst->data, dst->nb[1] / sizeof(float),
src0->type, src1->type);
if (!handled) {
// Fall back to default matrix multiply
ggml_compute_forward_mul_mat_default(¶ms, dst);
}
#endif
Related Pages
- Ggml_org_Ggml_Cpu_sgemm_ppc -- PowerPC MMA specialization included by this file.
- Ggml_org_Ggml_Cpu_compute_engine -- The compute engine that dispatches to llamafile_sgemm.
- Ggml_org_Ggml_Cpu_amx_mmq -- Intel AMX: alternative accelerated matmul path.
- Ggml_org_Ggml_Cpu_kleidiai_backend -- ARM KleidiAI: alternative accelerated matmul path.
- Ggml_org_Ggml_Cpu_quantization -- Quantization types supported by SGEMM kernels.