Implementation:Ggml org Ggml Sycl mmvq

Knowledge Sources	GGML
Domains	ML_Infrastructure, GPU_Compute, Quantization
Last Updated	2025-05-15 12:00 GMT

Overview

Quantized matrix-vector multiply (MMVQ) kernels for the SYCL backend, optimized for single-token inference with quantized weights.

Description

mmvq.cpp implements the optimized matrix-vector multiplication path for quantized weights on the SYCL backend. This is the most common inference scenario (batch=1, single token generation), making these kernels performance-critical. Two kernel variants are provided:

mul_mat_vec_q: The standard kernel where each subgroup (warp) processes one matrix row. Threads iterate over quantized blocks in the row, computing vec_dot_q dot products between quantized weight blocks and q8_1-quantized input vectors. Partial sums are reduced across the warp using dpct::permute_sub_group_by_xor shuffle operations, and thread 0 writes the final result.

mul_mat_vec_q_reorder: An optimized variant for Structure-of-Arrays (SoA) memory layouts. Instead of accessing interleaved block data, it reads separated scale and quant arrays for better memory coalescing on GPU hardware. Uses sycl::reduce_over_group for the final reduction.

The kernels are templated on quantization parameters (qk, qi, vdr) and the vec_dot function pointer, with instantiations for all supported quantization types: q4_0, q4_1, q5_0, q5_1, q8_0, q2_K, q3_K, q4_K, q5_K, q6_K, and IQ types.

The public dispatch function ggml_sycl_op_mul_mat_vec_q selects between the standard and reorder variants based on the device's optimize_feature.reorder capability and launches with appropriate work-group dimensions.

Usage

Called from the main SYCL backend during matrix-vector multiplication for single-token inference. The MMVQ path is the default for batch=1 unless GGML_SYCL_PRIORITIZE_DMMV is set, in which case DMMV is used instead. MMVQ is generally faster because it avoids full dequantization.

Code Reference

Source Location

Repository: GGML
File: src/ggml-sycl/mmvq.cpp
Lines: 1156

Signatures

// Standard quantized mat-vec kernel
template <int qk, int qi, typename block_q_t, int vdr, vec_dot_q_sycl_t vec_dot_q_sycl>
static void mul_mat_vec_q(const void * __restrict__ vx,
    const void * __restrict__ vy, float * __restrict__ dst,
    const int ncols, const int nrows, const sycl::nd_item<3> & item_ct1);

// Reordered (SoA) variant
template <typename reorder_vec_dot_q_sycl>
static void mul_mat_vec_q_reorder(const void * __restrict__ vx,
    const void * __restrict__ vy, float * __restrict__ dst,
    const int ncols, const int nrows, const sycl::nd_item<3> & nd_item);

// Public dispatch function
void ggml_sycl_op_mul_mat_vec_q(
    ggml_backend_sycl_context & ctx,
    const ggml_tensor * src0, const ggml_tensor * src1,
    ggml_tensor * dst, const char * src0_dd_i,
    const float * src1_ddf_i, const char * src1_ddq_i,
    float * dst_dd_i, const int64_t row_low, const int64_t row_high,
    const int64_t src1_ncols, const int64_t src1_padded_row_size,
    const dpct::queue_ptr & stream);

I/O Contract

Inputs

Name	Type	Required	Description
vx	const void *	Yes	Quantized weight matrix data
vy	const void *	Yes	q8_1-quantized input vector data
ncols	int	Yes	Number of columns in the weight matrix
nrows	int	Yes	Number of rows in the weight matrix

Outputs

Name	Type	Description
dst	float *	Output vector of size nrows with the matrix-vector product

Usage Examples

// Called internally during single-token inference (batch=1):
ggml_sycl_op_mul_mat_vec_q(ctx, weight_tensor, input_tensor, output_tensor,
    weight_data, nullptr, input_q8_data, output_data,
    0, nrows, 1, padded_row_size, stream);

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment