Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ggml org Ggml Cpu sgemm

From Leeroopedia


Metadata

Field Value
Page Type Implementation (SGEMM / tinyBLAS)
Knowledge Sources GGML
Domains ML_Infrastructure, Tensor_Computing, CPU_Backend, Quantized_Matrix_Multiplication
Last Updated 2025-05-15 12:00 GMT

Overview

Implements high-performance multithreaded CPU matrix multiplication (SGEMM) using tinyBLAS kernels optimized for cache-resident matrices, supporting both float and quantized formats.

Description

llamafile/sgemm.cpp originated from Mozilla's llamafile/tinyBLAS project and is a critical performance component providing optimized BLAS-like matrix multiplication without external library dependencies, designed specifically for LLM inference workloads. Key components include:

  1. Architecture-specific tinyBLAS classes:
    • tinyBLAS -- Generic SIMD for SSE/AVX/AVX-512/NEON with f32/f16/bf16 support.
    • tinyBLAS_Q0_ARM -- Quantized ARM NEON with q4_0/q8_0 dot products using vdotq_s32 and vmmlaq_s32.
    • tinyBLAS_Q0_AVX -- Quantized x86 AVX2/AVX-512 with _mm256_maddubs_epi16 for quantized dot products.
    • tinyBLAS_RVV -- RISC-V vector intrinsics with variable-length vector processing.
    • tinyBLAS_PPC / tinyBLAS_HP16_PPC -- PowerPC MMA (Matrix-Multiply Assist) via sgemm-ppc.h.
  2. Vectorized arithmetic: Overloaded add/sub/mul/madd functions for SSE (__m128), AVX (__m256), AVX-512 (__m512), NEON (float32x4_t), and s390x VXE (float32x4_t).
  3. FMA specialization: Template specialization of madd to use hardware FMA when available (_mm256_fmadd_ps, _mm512_fmadd_ps, AVX-512 BF16 _mm512_dpbf16_ps).
  4. Register blocking: Uses compile-time VECTOR_REGISTERS (16 for x86, 32 for ARM/AVX-512) to size the inner kernel tile for optimal register utilization.
  5. Entry point: llamafile_sgemm dispatches to the appropriate class based on data types (f32, f16, bf16, q4_0, q8_0, q4_K, q5_K, q6_K, iq4_nl) and detected architecture features.
  6. Transpose convention: Computes C = A^T * B which matches the common contiguous layout in GGML.

Usage

Called from the compute engine during matrix multiply dispatch when GGML_USE_LLAMAFILE is enabled (disabled on ARM with SVE or MATMUL_INT8, which prefer KleidiAI).

Code Reference

Source Location

GGML repo, file: src/ggml-cpu/llamafile/sgemm.cpp (3681 lines).

Signature

bool llamafile_sgemm(
    const struct ggml_compute_params * params,
    int64_t m, int64_t n, int64_t k,
    const void * A, int64_t lda,
    const void * B, int64_t ldb,
    float * C, int64_t ldc,
    int Atype, int Btype);

Import

#include "llamafile/sgemm.h"

I/O Contract

Inputs

Parameter Type Required Description
params const ggml_compute_params * Yes Thread index and count for parallel execution.
m, n, k int64_t Yes Matrix dimensions: result is m x n, inner dimension k.
A const void * Yes First input matrix (may be quantized).
B const void * Yes Second input matrix (may be quantized).
lda, ldb, ldc int64_t Yes Leading dimensions in bytes or elements depending on type.
Atype, Btype int Yes GGML type enums for A and B matrices.

Outputs

Output Type Description
Return value bool true if the operation was handled by llamafile SGEMM, false if the type combination is unsupported (fallback to default path).
C float * Result matrix: C = A^T * B in f32 format.

Usage Examples

SGEMM Dispatch in Compute Engine (Internal)

#ifdef GGML_USE_LLAMAFILE
// Try llamafile SGEMM first
bool handled = llamafile_sgemm(&params,
    m, n, k,
    src0->data, nb01 / ggml_type_size(src0->type),
    src1->data, nb11 / ggml_type_size(src1->type),
    (float *)dst->data, dst->nb[1] / sizeof(float),
    src0->type, src1->type);

if (!handled) {
    // Fall back to default matrix multiply
    ggml_compute_forward_mul_mat_default(&params, dst);
}
#endif

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment