Implementation:Ggml org Ggml Sycl mmq
| Knowledge Sources | |
|---|---|
| Domains | ML_Infrastructure, GPU_Compute, Quantization |
| Last Updated | 2025-05-15 12:00 GMT |
Overview
Quantized matrix-matrix multiplication (MMQ) kernels for the SYCL backend, implementing tiled multiply without full dequantization.
Description
mmq.cpp implements the primary matrix multiplication path for small batch sizes with quantized weights on the SYCL backend. At 3030 lines, it is one of the largest kernel files and follows a three-phase tiled approach for each quantization type:
- allocate_tiles: Assigns local memory tile pointers for quantized data (x_ql for quant values, x_dm for scales/deltas, x_qh for high bits, x_sc for scales). Each quant type has its own tile layout.
- load_tiles: Loads quantized blocks from global memory into shared memory tiles, with work distributed across warps. Includes bounds checking (need_check template parameter) and block-level data extraction.
- vec_dot_q_mul_mat: Computes partial dot products between the loaded quantized tiles and q8_1-quantized input, accumulating results per output element.
The main kernel mul_mat_q orchestrates these three phases in a tiled loop, processing mmq_y rows of output per iteration. The tile sizes and warp counts are tuned per quantization type through template parameters.
Supported quantization types include q4_0, q4_1, q5_0, q5_1, q8_0, q2_K, q3_K, q4_K, q5_K, and q6_K. The public dispatch function ggml_sycl_op_mul_mat_q selects the correct instantiation based on the source tensor's type.
The MMQ path is preferred over DMMV for batch sizes up to MMQ_MAX_BATCH_SIZE (32) because it avoids full dequantization, operating directly on quantized integer representations.
Usage
Called from the main SYCL backend during matrix multiplication when the batch size is small (up to 32) and quantized weights are used. The backend's matrix multiplication router selects MMQ when XMX/tensor cores are not being used for GEMM.
Code Reference
Source Location
- Repository: GGML
- File: src/ggml-sycl/mmq.cpp
- Lines: 3030
Signatures
// Function pointer types for the three-phase pattern
typedef void (*allocate_tiles_sycl_t)(int** x_ql, sycl::half2** x_dm, int** x_qh, int** x_sc);
typedef void (*load_tiles_sycl_t)(const void* __restrict__ vx, int* __restrict__ x_ql,
sycl::half2* __restrict__ x_dm, int* __restrict__ x_qh, int* __restrict__ x_sc,
const int& i_offset, const int& i_max, const int& k, const int& blocks_per_row);
typedef float (*vec_dot_q_mul_mat_sycl_t)(const int* __restrict__ x_ql,
const sycl::half2* __restrict__ x_dm, const int* __restrict__ x_qh,
const int* __restrict__ x_sc, const int* __restrict__ y_qs,
const sycl::half2* __restrict__ y_ms, const int& i, const int& j, const int& k);
// Per-type tile functions (q4_0 example)
template <int mmq_y>
static __dpct_inline__ void allocate_tiles_q4_0(int **x_ql, sycl::half2 **x_dm,
int **x_qh, int **x_sc, int *tile_x_qs_q4_0, float *tile_x_d_q4_0);
template <int mmq_y, int nwarps, bool need_check>
static __dpct_inline__ void load_tiles_q4_0(const void *__restrict__ vx,
int *__restrict__ x_ql, sycl::half2 *__restrict__ x_dm,
int *__restrict__ x_qh, int *__restrict__ x_sc,
const int &i_offset, const int &i_max, const int &k, const int &blocks_per_row);
// Public dispatch function
void ggml_sycl_op_mul_mat_q(
ggml_backend_sycl_context & ctx,
const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst,
const char * src0_dd_i, const float * src1_ddf_i,
const char * src1_ddq_i, float * dst_dd_i,
const int64_t row_low, const int64_t row_high,
const int64_t src1_ncols, const int64_t src1_padded_row_size,
const dpct::queue_ptr & stream);
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| src0_dd_i | const char * | Yes | Quantized weight matrix data on device |
| src1_ddq_i | const char * | Yes | q8_1-quantized input matrix on device |
| row_low, row_high | int64_t | Yes | Row range to process (for multi-GPU splitting) |
| src1_ncols | int64_t | Yes | Number of columns in the input matrix (batch size) |
Outputs
| Name | Type | Description |
|---|---|---|
| dst_dd_i | float * | Output matrix with multiplication results in f32 |
Usage Examples
// Called internally by the SYCL backend matrix multiplication router:
// When batch_size <= MMQ_MAX_BATCH_SIZE and weights are quantized
ggml_sycl_op_mul_mat_q(ctx, weight_tensor, input_tensor, output_tensor,
weight_data, nullptr, input_q8_data, output_data,
0, nrows, batch_size, padded_row_size, stream);