Implementation:Ggml org Ggml Sycl mmvq
| Knowledge Sources | |
|---|---|
| Domains | ML_Infrastructure, GPU_Compute, Quantization |
| Last Updated | 2025-05-15 12:00 GMT |
Overview
Quantized matrix-vector multiply (MMVQ) kernels for the SYCL backend, optimized for single-token inference with quantized weights.
Description
mmvq.cpp implements the optimized matrix-vector multiplication path for quantized weights on the SYCL backend. This is the most common inference scenario (batch=1, single token generation), making these kernels performance-critical. Two kernel variants are provided:
- mul_mat_vec_q: The standard kernel where each subgroup (warp) processes one matrix row. Threads iterate over quantized blocks in the row, computing vec_dot_q dot products between quantized weight blocks and q8_1-quantized input vectors. Partial sums are reduced across the warp using dpct::permute_sub_group_by_xor shuffle operations, and thread 0 writes the final result.
- mul_mat_vec_q_reorder: An optimized variant for Structure-of-Arrays (SoA) memory layouts. Instead of accessing interleaved block data, it reads separated scale and quant arrays for better memory coalescing on GPU hardware. Uses sycl::reduce_over_group for the final reduction.
The kernels are templated on quantization parameters (qk, qi, vdr) and the vec_dot function pointer, with instantiations for all supported quantization types: q4_0, q4_1, q5_0, q5_1, q8_0, q2_K, q3_K, q4_K, q5_K, q6_K, and IQ types.
The public dispatch function ggml_sycl_op_mul_mat_vec_q selects between the standard and reorder variants based on the device's optimize_feature.reorder capability and launches with appropriate work-group dimensions.
Usage
Called from the main SYCL backend during matrix-vector multiplication for single-token inference. The MMVQ path is the default for batch=1 unless GGML_SYCL_PRIORITIZE_DMMV is set, in which case DMMV is used instead. MMVQ is generally faster because it avoids full dequantization.
Code Reference
Source Location
- Repository: GGML
- File: src/ggml-sycl/mmvq.cpp
- Lines: 1156
Signatures
// Standard quantized mat-vec kernel
template <int qk, int qi, typename block_q_t, int vdr, vec_dot_q_sycl_t vec_dot_q_sycl>
static void mul_mat_vec_q(const void * __restrict__ vx,
const void * __restrict__ vy, float * __restrict__ dst,
const int ncols, const int nrows, const sycl::nd_item<3> & item_ct1);
// Reordered (SoA) variant
template <typename reorder_vec_dot_q_sycl>
static void mul_mat_vec_q_reorder(const void * __restrict__ vx,
const void * __restrict__ vy, float * __restrict__ dst,
const int ncols, const int nrows, const sycl::nd_item<3> & nd_item);
// Public dispatch function
void ggml_sycl_op_mul_mat_vec_q(
ggml_backend_sycl_context & ctx,
const ggml_tensor * src0, const ggml_tensor * src1,
ggml_tensor * dst, const char * src0_dd_i,
const float * src1_ddf_i, const char * src1_ddq_i,
float * dst_dd_i, const int64_t row_low, const int64_t row_high,
const int64_t src1_ncols, const int64_t src1_padded_row_size,
const dpct::queue_ptr & stream);
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| vx | const void * | Yes | Quantized weight matrix data |
| vy | const void * | Yes | q8_1-quantized input vector data |
| ncols | int | Yes | Number of columns in the weight matrix |
| nrows | int | Yes | Number of rows in the weight matrix |
Outputs
| Name | Type | Description |
|---|---|---|
| dst | float * | Output vector of size nrows with the matrix-vector product |
Usage Examples
// Called internally during single-token inference (batch=1):
ggml_sycl_op_mul_mat_vec_q(ctx, weight_tensor, input_tensor, output_tensor,
weight_data, nullptr, input_q8_data, output_data,
0, nrows, 1, padded_row_size, stream);