Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ggml org Ggml Sycl mmq

From Leeroopedia


Knowledge Sources
Domains ML_Infrastructure, GPU_Compute, Quantization
Last Updated 2025-05-15 12:00 GMT

Overview

Quantized matrix-matrix multiplication (MMQ) kernels for the SYCL backend, implementing tiled multiply without full dequantization.

Description

mmq.cpp implements the primary matrix multiplication path for small batch sizes with quantized weights on the SYCL backend. At 3030 lines, it is one of the largest kernel files and follows a three-phase tiled approach for each quantization type:

  • allocate_tiles: Assigns local memory tile pointers for quantized data (x_ql for quant values, x_dm for scales/deltas, x_qh for high bits, x_sc for scales). Each quant type has its own tile layout.
  • load_tiles: Loads quantized blocks from global memory into shared memory tiles, with work distributed across warps. Includes bounds checking (need_check template parameter) and block-level data extraction.
  • vec_dot_q_mul_mat: Computes partial dot products between the loaded quantized tiles and q8_1-quantized input, accumulating results per output element.

The main kernel mul_mat_q orchestrates these three phases in a tiled loop, processing mmq_y rows of output per iteration. The tile sizes and warp counts are tuned per quantization type through template parameters.

Supported quantization types include q4_0, q4_1, q5_0, q5_1, q8_0, q2_K, q3_K, q4_K, q5_K, and q6_K. The public dispatch function ggml_sycl_op_mul_mat_q selects the correct instantiation based on the source tensor's type.

The MMQ path is preferred over DMMV for batch sizes up to MMQ_MAX_BATCH_SIZE (32) because it avoids full dequantization, operating directly on quantized integer representations.

Usage

Called from the main SYCL backend during matrix multiplication when the batch size is small (up to 32) and quantized weights are used. The backend's matrix multiplication router selects MMQ when XMX/tensor cores are not being used for GEMM.

Code Reference

Source Location

  • Repository: GGML
  • File: src/ggml-sycl/mmq.cpp
  • Lines: 3030

Signatures

// Function pointer types for the three-phase pattern
typedef void (*allocate_tiles_sycl_t)(int** x_ql, sycl::half2** x_dm, int** x_qh, int** x_sc);
typedef void (*load_tiles_sycl_t)(const void* __restrict__ vx, int* __restrict__ x_ql,
    sycl::half2* __restrict__ x_dm, int* __restrict__ x_qh, int* __restrict__ x_sc,
    const int& i_offset, const int& i_max, const int& k, const int& blocks_per_row);
typedef float (*vec_dot_q_mul_mat_sycl_t)(const int* __restrict__ x_ql,
    const sycl::half2* __restrict__ x_dm, const int* __restrict__ x_qh,
    const int* __restrict__ x_sc, const int* __restrict__ y_qs,
    const sycl::half2* __restrict__ y_ms, const int& i, const int& j, const int& k);

// Per-type tile functions (q4_0 example)
template <int mmq_y>
static __dpct_inline__ void allocate_tiles_q4_0(int **x_ql, sycl::half2 **x_dm,
    int **x_qh, int **x_sc, int *tile_x_qs_q4_0, float *tile_x_d_q4_0);

template <int mmq_y, int nwarps, bool need_check>
static __dpct_inline__ void load_tiles_q4_0(const void *__restrict__ vx,
    int *__restrict__ x_ql, sycl::half2 *__restrict__ x_dm,
    int *__restrict__ x_qh, int *__restrict__ x_sc,
    const int &i_offset, const int &i_max, const int &k, const int &blocks_per_row);

// Public dispatch function
void ggml_sycl_op_mul_mat_q(
    ggml_backend_sycl_context & ctx,
    const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst,
    const char * src0_dd_i, const float * src1_ddf_i,
    const char * src1_ddq_i, float * dst_dd_i,
    const int64_t row_low, const int64_t row_high,
    const int64_t src1_ncols, const int64_t src1_padded_row_size,
    const dpct::queue_ptr & stream);

I/O Contract

Inputs

Name Type Required Description
src0_dd_i const char * Yes Quantized weight matrix data on device
src1_ddq_i const char * Yes q8_1-quantized input matrix on device
row_low, row_high int64_t Yes Row range to process (for multi-GPU splitting)
src1_ncols int64_t Yes Number of columns in the input matrix (batch size)

Outputs

Name Type Description
dst_dd_i float * Output matrix with multiplication results in f32

Usage Examples

// Called internally by the SYCL backend matrix multiplication router:
// When batch_size <= MMQ_MAX_BATCH_SIZE and weights are quantized
ggml_sycl_op_mul_mat_q(ctx, weight_tensor, input_tensor, output_tensor,
    weight_data, nullptr, input_q8_data, output_data,
    0, nrows, batch_size, padded_row_size, stream);

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment