Implementation:Ggml org Ggml Ggml blas backend

Metadata

Field	Value
Page Type	Implementation (Backend)
Knowledge Sources	GGML
Domains	ML_Infrastructure, Tensor_Computing, Linear_Algebra
Last Updated	2026-02-10 12:00 GMT

Overview

Implements the BLAS backend for GGML, accelerating matrix multiplication and outer product operations by delegating to vendor-optimized BLAS libraries.

Description

ggml-blas.cpp provides a complete GGML backend implementation that offloads GGML_OP_MUL_MAT and GGML_OP_OUT_PROD operations to external BLAS (Basic Linear Algebra Subprograms) libraries. The backend supports multiple BLAS vendors through compile-time selection:

Apple Accelerate (vecLib/BLAS)
Intel MKL
BLIS
NVPL (NVIDIA Performance Libraries)
OpenBLAS (default fallback via cblas.h)

The core computation flow for matrix multiplication works as follows:

Type conversion: If src0 is not GGML_TYPE_F32, the backend dequantizes all weights into a temporary F32 work buffer using the type's to_float function. This conversion is parallelized using either OpenMP or std::async futures.
BLAS dispatch: The converted (or already F32) data is passed to cblas_sgemm for single-precision general matrix multiplication. Broadcasting across batch dimensions (ne2, ne3) is handled via loops over the batch indices.
Outer product: For GGML_OP_OUT_PROD, the backend calls cblas_sgemm with appropriate transpose flags, supporting both transposed and non-transposed source tensors.

The backend registers itself as an accelerator device (GGML_BACKEND_DEVICE_TYPE_ACCEL) and uses host (CPU) memory buffers. It only claims support for operations where BLAS is likely faster than the CPU backend -- specifically when all matrix dimensions exceed a minimum batch size of 32.

Usage

Use this backend when:

Large matrix multiplications dominate your workload and a BLAS library is available.
You want to accelerate CPU-side inference by leveraging vendor-optimized SGEMM kernels.
Your model uses quantized weights (the backend handles automatic dequantization to F32 before BLAS calls).

Code Reference

Source Location

GGML repo, file: src/ggml-blas/ggml-blas.cpp, 518 lines.

Signature

// Backend initialization
ggml_backend_t ggml_backend_blas_init(void);

// Backend identification
bool ggml_backend_is_blas(ggml_backend_t backend);

// Thread configuration
void ggml_backend_blas_set_n_threads(ggml_backend_t backend_blas, int n_threads);

// Backend registration
ggml_backend_reg_t ggml_backend_blas_reg(void);

Import

#include "ggml-blas.h"

Dependencies

ggml-impl.h -- internal GGML utilities
ggml-blas.h -- public BLAS backend API header
ggml-backend-impl.h -- backend implementation interface
A BLAS library (Accelerate, MKL, OpenBLAS, BLIS, or NVPL)

I/O Contract

Inputs

Parameter	Type	Required	Description
`dst->src[0]`	`ggml_tensor *`	Yes	Weight matrix (supports F32, F16, BF16, and quantized types). Must be contiguous.
`dst->src[1]`	`ggml_tensor *`	Yes	Input activation matrix. Must be contiguous and of type `GGML_TYPE_F32`.
`n_threads`	`int`	No	Number of threads for dequantization and BLAS operations (default: `GGML_DEFAULT_N_THREADS`).

Outputs

Output	Type	Description
`dst`	`ggml_tensor *`	Result matrix of type `GGML_TYPE_F32`. For MUL_MAT: `dst = src1 * src0^T`. For OUT_PROD: `dst = src1^T * src0`.
Return value	`ggml_backend_t`	From `ggml_backend_blas_init()`: pointer to the initialized BLAS backend instance.

Usage Examples

Initializing the BLAS Backend

#include "ggml-blas.h"

// Create and configure BLAS backend
ggml_backend_t blas = ggml_backend_blas_init();
ggml_backend_blas_set_n_threads(blas, 8);

// Use with scheduler for automatic operation offloading
ggml_backend_t backends[] = { blas, cpu_backend };
ggml_backend_sched_t sched = ggml_backend_sched_new(backends, NULL, 2, max_nodes, false);

Checking Backend Type

if (ggml_backend_is_blas(backend)) {
    ggml_backend_blas_set_n_threads(backend, 4);
}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment