Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Ggml org Ggml Ggml blas backend

From Leeroopedia


Metadata

Field Value
Page Type Implementation (Backend)
Knowledge Sources GGML
Domains ML_Infrastructure, Tensor_Computing, Linear_Algebra
Last Updated 2026-02-10 12:00 GMT

Overview

Implements the BLAS backend for GGML, accelerating matrix multiplication and outer product operations by delegating to vendor-optimized BLAS libraries.

Description

ggml-blas.cpp provides a complete GGML backend implementation that offloads GGML_OP_MUL_MAT and GGML_OP_OUT_PROD operations to external BLAS (Basic Linear Algebra Subprograms) libraries. The backend supports multiple BLAS vendors through compile-time selection:

  • Apple Accelerate (vecLib/BLAS)
  • Intel MKL
  • BLIS
  • NVPL (NVIDIA Performance Libraries)
  • OpenBLAS (default fallback via cblas.h)

The core computation flow for matrix multiplication works as follows:

  1. Type conversion: If src0 is not GGML_TYPE_F32, the backend dequantizes all weights into a temporary F32 work buffer using the type's to_float function. This conversion is parallelized using either OpenMP or std::async futures.
  2. BLAS dispatch: The converted (or already F32) data is passed to cblas_sgemm for single-precision general matrix multiplication. Broadcasting across batch dimensions (ne2, ne3) is handled via loops over the batch indices.
  3. Outer product: For GGML_OP_OUT_PROD, the backend calls cblas_sgemm with appropriate transpose flags, supporting both transposed and non-transposed source tensors.

The backend registers itself as an accelerator device (GGML_BACKEND_DEVICE_TYPE_ACCEL) and uses host (CPU) memory buffers. It only claims support for operations where BLAS is likely faster than the CPU backend -- specifically when all matrix dimensions exceed a minimum batch size of 32.

Usage

Use this backend when:

  • Large matrix multiplications dominate your workload and a BLAS library is available.
  • You want to accelerate CPU-side inference by leveraging vendor-optimized SGEMM kernels.
  • Your model uses quantized weights (the backend handles automatic dequantization to F32 before BLAS calls).

Code Reference

Source Location

GGML repo, file: src/ggml-blas/ggml-blas.cpp, 518 lines.

Signature

// Backend initialization
ggml_backend_t ggml_backend_blas_init(void);

// Backend identification
bool ggml_backend_is_blas(ggml_backend_t backend);

// Thread configuration
void ggml_backend_blas_set_n_threads(ggml_backend_t backend_blas, int n_threads);

// Backend registration
ggml_backend_reg_t ggml_backend_blas_reg(void);

Import

#include "ggml-blas.h"

Dependencies

  • ggml-impl.h -- internal GGML utilities
  • ggml-blas.h -- public BLAS backend API header
  • ggml-backend-impl.h -- backend implementation interface
  • A BLAS library (Accelerate, MKL, OpenBLAS, BLIS, or NVPL)

I/O Contract

Inputs

Parameter Type Required Description
dst->src[0] ggml_tensor * Yes Weight matrix (supports F32, F16, BF16, and quantized types). Must be contiguous.
dst->src[1] ggml_tensor * Yes Input activation matrix. Must be contiguous and of type GGML_TYPE_F32.
n_threads int No Number of threads for dequantization and BLAS operations (default: GGML_DEFAULT_N_THREADS).

Outputs

Output Type Description
dst ggml_tensor * Result matrix of type GGML_TYPE_F32. For MUL_MAT: dst = src1 * src0^T. For OUT_PROD: dst = src1^T * src0.
Return value ggml_backend_t From ggml_backend_blas_init(): pointer to the initialized BLAS backend instance.

Usage Examples

Initializing the BLAS Backend

#include "ggml-blas.h"

// Create and configure BLAS backend
ggml_backend_t blas = ggml_backend_blas_init();
ggml_backend_blas_set_n_threads(blas, 8);

// Use with scheduler for automatic operation offloading
ggml_backend_t backends[] = { blas, cpu_backend };
ggml_backend_sched_t sched = ggml_backend_sched_new(backends, NULL, 2, max_nodes, false);

Checking Backend Type

if (ggml_backend_is_blas(backend)) {
    ggml_backend_blas_set_n_threads(backend, 4);
}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment