Implementation:Ggml org Ggml Sycl dmmv

Knowledge Sources	GGML
Domains	ML_Infrastructure, GPU_Compute, Quantization
Last Updated	2025-05-15 12:00 GMT

Overview

Dequantize-multiply-matrix-vector (DMMV) kernels providing an alternative matrix-vector multiplication path that dequantizes weights on-the-fly during computation.

Description

dmmv.cpp implements the DMMV approach to quantized matrix-vector multiplication for the SYCL backend. Unlike the MMVQ path (which uses precomputed dot products on quantized blocks), DMMV fully dequantizes each weight element before multiplying with the input vector. The core kernel dequantize_mul_mat_vec works as follows:

Each work-group row processes one matrix row, with threads iterating over columns in strides of 2 * GGML_SYCL_DMMV_X (default 64 elements per iteration).
For each iteration, the thread dequantizes a pair of values using the appropriate dequantize_kernel function, multiplies them with the corresponding input vector elements, and accumulates partial sums.
After processing all columns, partial sums are reduced across the warp using dpct::permute_sub_group_by_xor shuffle operations.
The final sum is written to the output vector by thread 0.

The file also includes dequantize_mul_mat_vec_reorder for the Structure-of-Arrays memory layout variant, and helper conversion functions (convert_f16, convert_f32) for non-quantized inputs.

The public entry point ggml_sycl_op_dequantize_mul_mat_vec dispatches to the correct template instantiation based on the quantization type (q4_0 through q6_K, plus f16/f32).

Usage

Called from the main SYCL backend during matrix-vector multiplication when DMMV is selected over MMVQ. This path can be preferred via the GGML_SYCL_PRIORITIZE_DMMV environment variable for certain hardware configurations where full dequantization is faster than quantized dot products.

Code Reference

Source Location

Repository: GGML
File: src/ggml-sycl/dmmv.cpp
Lines: 1162

Signatures

// Core DMMV kernel template
template <int qk, int qr, dequantize_kernel_t dequantize_kernel>
static void dequantize_mul_mat_vec(const void * __restrict__ vx,
    const dfloat * __restrict__ y, float * __restrict__ dst,
    const int ncols, const int nrows,
    const sycl::nd_item<3> &item_ct1);

// Reordered variant
template <int qk, int qr, dequantize_kernel_t_reorder dequantize_kernel_reorder>
static void dequantize_mul_mat_vec_reorder(const void * __restrict__ vx,
    const dfloat * __restrict__ y, float * __restrict__ dst,
    const int ncols, const int nrows,
    const sycl::nd_item<3> &item_ct1);

// Public dispatch function
void ggml_sycl_op_dequantize_mul_mat_vec(
    ggml_backend_sycl_context & ctx,
    const ggml_tensor * src0, const ggml_tensor * src1,
    ggml_tensor * dst, const char * src0_dd_i,
    const float * src1_ddf_i, const char * src1_ddq_i,
    float * dst_dd_i, const int64_t row_low, const int64_t row_high,
    const int64_t src1_ncols, const int64_t src1_padded_row_size,
    const dpct::queue_ptr &stream);

I/O Contract

Inputs

Name	Type	Required	Description
vx	const void *	Yes	Quantized weight matrix data
y	const dfloat *	Yes	Input vector (f32 or f16)
ncols	int	Yes	Number of columns in the weight matrix
nrows	int	Yes	Number of rows in the weight matrix

Outputs

Name	Type	Description
dst	float *	Output vector of size nrows with matrix-vector product results

Usage Examples

// Called internally during quantized matrix-vector multiplication:
ggml_sycl_op_dequantize_mul_mat_vec(
    ctx, weight_tensor, input_tensor, output_tensor,
    weight_data, input_data, nullptr,
    output_data, 0, nrows, 1, padded_row_size, stream);

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment