Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ggml org Ggml Hexagon matmul ops

From Leeroopedia


Implementation Metadata
File Name src/ggml-hexagon/htp/matmul-ops.c
Repository ggml-org/ggml
Lines 2476
Language C
Domain Tags ML_Infrastructure, DSP_Computing, Matrix_Operations
Status Active
Last Updated 2025-05-15 12:00 GMT
Knowledge Sources ggml-org/ggml repository

Overview

matmul-ops.c is the DSP-side implementation of matrix multiplication (and mul_mat_id) operations for the Hexagon HVX vector processor. At ~2476 lines, it is the largest and most performance-critical operation file in the entire Hexagon backend. Matrix multiplication dominates inference compute time, and this file supports quantized formats (Q4_0, Q8_0, MXFP4) for memory-efficient inference.

Description

The file defines htp_matmul_type with function pointers for vec_dot and vec_dot_rx2 (dual-row dot product) variants. It uses HVX vdelta control tables for value replication across vector lanes:

  • repl_1x_f32 -- Replicate first FP32 value across all 32 elements
  • repl_4x_f32 -- Replicate first 4 FP32 values across lanes
  • repl_interleave_8x_f32 -- Replicate and interleave 8 values
  • repl_1x_f16, repl_2x_f16 -- FP16 value replication
  • expand_x32_e8m0 -- Expand 32 e8m0 values to uint32

Supported type combinations include F32xF32, F32xF16, F16xF16, Q4_0, Q8_0, and MXFP4. Scratchpad tiling uses configurable row counts (MM_SPAD_SRC0_NROWS=16, MM_SPAD_SRC1_NROWS=16, MM_SPAD_DST_NROWS=2).

The file also implements mul_mat_id for expert-routing in mixture-of-experts models with mmid_row_mapping.

Usage

Dispatched from the DSP-side message loop for GGML_OP_MUL_MAT and GGML_OP_MUL_MAT_ID operations.

Code Reference

Source Location

Repository File Lines
ggml-org/ggml src/ggml-hexagon/htp/matmul-ops.c 2476

Key Signatures

#define MM_SPAD_SRC0_NROWS 16
#define MM_SPAD_SRC1_NROWS 16
#define MM_SPAD_DST_NROWS  2

struct htp_matmul_type {
    const char * type;
    void (*vec_dot)(const int n, float * restrict s, const void * restrict vx, const void * restrict vy);
    void (*vec_dot_rx2)(const int n, float * restrict s, const void * restrict vx,
        uint32_t vx_row_size, const void * restrict vy);
};

struct mmid_row_mapping {
    // Maps expert IDs to row indices for MoE dispatch
};

I/O Contract

Inputs

  • src0 -- Weight matrix (may be quantized: F32, F16, Q4_0, Q8_0, MXFP4)
  • src1 -- Input matrix (typically F32 or F16)
  • op_params -- Operation parameters including type combination info

Outputs

  • dst -- Result matrix (F32)

Usage Examples

Internal matmul dispatch:

// Selected based on src0/src1 type combination
htp_matmul_type matmul = get_matmul_type(src0_type, src1_type);
matmul.vec_dot(n, &result, src0_row, src1_row);
// Or dual-row variant for higher throughput
matmul.vec_dot_rx2(n, results, src0_rows, row_size, src1_row);

Related Pages

Implements Principle

Related Implementations

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment