Implementation:Ggml org Ggml Ggml blas backend
Metadata
| Field | Value |
|---|---|
| Page Type | Implementation (Backend) |
| Knowledge Sources | GGML |
| Domains | ML_Infrastructure, Tensor_Computing, Linear_Algebra |
| Last Updated | 2026-02-10 12:00 GMT |
Overview
Implements the BLAS backend for GGML, accelerating matrix multiplication and outer product operations by delegating to vendor-optimized BLAS libraries.
Description
ggml-blas.cpp provides a complete GGML backend implementation that offloads GGML_OP_MUL_MAT and GGML_OP_OUT_PROD operations to external BLAS (Basic Linear Algebra Subprograms) libraries. The backend supports multiple BLAS vendors through compile-time selection:
- Apple Accelerate (vecLib/BLAS)
- Intel MKL
- BLIS
- NVPL (NVIDIA Performance Libraries)
- OpenBLAS (default fallback via
cblas.h)
The core computation flow for matrix multiplication works as follows:
- Type conversion: If
src0is notGGML_TYPE_F32, the backend dequantizes all weights into a temporary F32 work buffer using the type'sto_floatfunction. This conversion is parallelized using either OpenMP orstd::asyncfutures. - BLAS dispatch: The converted (or already F32) data is passed to
cblas_sgemmfor single-precision general matrix multiplication. Broadcasting across batch dimensions (ne2, ne3) is handled via loops over the batch indices. - Outer product: For
GGML_OP_OUT_PROD, the backend callscblas_sgemmwith appropriate transpose flags, supporting both transposed and non-transposed source tensors.
The backend registers itself as an accelerator device (GGML_BACKEND_DEVICE_TYPE_ACCEL) and uses host (CPU) memory buffers. It only claims support for operations where BLAS is likely faster than the CPU backend -- specifically when all matrix dimensions exceed a minimum batch size of 32.
Usage
Use this backend when:
- Large matrix multiplications dominate your workload and a BLAS library is available.
- You want to accelerate CPU-side inference by leveraging vendor-optimized SGEMM kernels.
- Your model uses quantized weights (the backend handles automatic dequantization to F32 before BLAS calls).
Code Reference
Source Location
GGML repo, file: src/ggml-blas/ggml-blas.cpp, 518 lines.
Signature
// Backend initialization
ggml_backend_t ggml_backend_blas_init(void);
// Backend identification
bool ggml_backend_is_blas(ggml_backend_t backend);
// Thread configuration
void ggml_backend_blas_set_n_threads(ggml_backend_t backend_blas, int n_threads);
// Backend registration
ggml_backend_reg_t ggml_backend_blas_reg(void);
Import
#include "ggml-blas.h"
Dependencies
ggml-impl.h-- internal GGML utilitiesggml-blas.h-- public BLAS backend API headerggml-backend-impl.h-- backend implementation interface- A BLAS library (Accelerate, MKL, OpenBLAS, BLIS, or NVPL)
I/O Contract
Inputs
| Parameter | Type | Required | Description |
|---|---|---|---|
dst->src[0] |
ggml_tensor * |
Yes | Weight matrix (supports F32, F16, BF16, and quantized types). Must be contiguous. |
dst->src[1] |
ggml_tensor * |
Yes | Input activation matrix. Must be contiguous and of type GGML_TYPE_F32.
|
n_threads |
int |
No | Number of threads for dequantization and BLAS operations (default: GGML_DEFAULT_N_THREADS).
|
Outputs
| Output | Type | Description |
|---|---|---|
dst |
ggml_tensor * |
Result matrix of type GGML_TYPE_F32. For MUL_MAT: dst = src1 * src0^T. For OUT_PROD: dst = src1^T * src0.
|
| Return value | ggml_backend_t |
From ggml_backend_blas_init(): pointer to the initialized BLAS backend instance.
|
Usage Examples
Initializing the BLAS Backend
#include "ggml-blas.h"
// Create and configure BLAS backend
ggml_backend_t blas = ggml_backend_blas_init();
ggml_backend_blas_set_n_threads(blas, 8);
// Use with scheduler for automatic operation offloading
ggml_backend_t backends[] = { blas, cpu_backend };
ggml_backend_sched_t sched = ggml_backend_sched_new(backends, NULL, 2, max_nodes, false);
Checking Backend Type
if (ggml_backend_is_blas(backend)) {
ggml_backend_blas_set_n_threads(backend, 4);
}