Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ggml org Ggml Metal ops

From Leeroopedia


Metadata

Field Value
Page Type Implementation (API Doc)
Knowledge Sources GGML
Domains ML_Infrastructure, Tensor_Computing, GPU_Computing
Last Updated 2025-05-15 12:00 GMT

Overview

Implements the Metal kernel dispatch logic for all supported GGML operations, including operation fusion and concurrent encoding on Apple GPUs.

Description

ggml-metal-ops.cpp is the largest file in the Metal backend (4300+ lines) and contains the core compute dispatch for every supported operation. It provides:

  1. Operation context (ggml_metal_op): A class that manages a Metal command encoder and iterates over graph nodes. It filters empty operations, supports operation fusion (checking if consecutive ops can be merged via ggml_can_fuse_ext), and tracks memory ranges for concurrency via ggml_mem_ranges.
  2. Buffer resolution: The ggml_metal_get_buffer_id helper resolves tensor buffer pointers, accounting for view sources, to obtain Metal buffer identifiers for kernel argument binding.
  3. Per-operation dispatch: Each GGML operation has a dedicated dispatch function that:
    • Selects the appropriate compiled Metal pipeline based on tensor types and quantization formats
    • Populates the kernel argument struct (ggml_metal_kargs_*) with tensor shape and stride metadata
    • Sets buffer bindings for source and destination tensors
    • Dispatches threadgroups with appropriate dimensions
  4. Concurrency management: The ggml_metal_op_concurrency_reset function resets memory range tracking when starting a new concurrent group. Operations that do not conflict in memory can be encoded concurrently.

Key operations dispatched include: matrix multiplication (mul_mat, mul_mv), flash attention, element-wise operations, RoPE, softmax, layer normalization, quantization/dequantization, pooling, convolution, and many more.

Usage

This module is used internally by the Metal backend. It is called from the backend's graph_compute callback when executing a computation graph on Apple GPUs. User code interacts with it indirectly through the GGML backend scheduling API.

Code Reference

Source Location

GGML repo, file: src/ggml-metal/ggml-metal-ops.cpp (4303 lines).

Signatures

ggml_metal_op_t ggml_metal_op_init(
    ggml_metal_device_t dev,
    ggml_metal_cmd_buf_t cmd_buf,
    ggml_cgraph * gf,
    int idx_start, int idx_end,
    bool use_fusion, bool use_concurrency, bool use_capture,
    int debug_graph, int debug_fusion);

void ggml_metal_op_free(ggml_metal_op_t ctx);
int  ggml_metal_op_n_nodes(ggml_metal_op_t ctx);

// Internal dispatch functions (static):
// ggml_metal_op_mul_mat(...)
// ggml_metal_op_flash_attn_ext(...)
// ggml_metal_op_encode(...)

Import

#include "ggml-metal-ops.h"

I/O Contract

Inputs

Parameter Type Required Description
dev ggml_metal_device_t Yes Metal device handle providing access to the shader library and device capabilities.
cmd_buf ggml_metal_cmd_buf_t Yes Metal command buffer into which compute commands are encoded.
gf ggml_cgraph * Yes The computation graph containing the operations to dispatch.
idx_start / idx_end int Yes Range of node indices within the graph to process.
use_fusion bool Yes Whether to attempt fusing consecutive compatible operations.
use_concurrency bool Yes Whether to enable concurrent kernel encoding for non-conflicting operations.

Outputs

Output Type Description
Op context ggml_metal_op_t Opaque handle managing the encoding session. Compute commands are written into the provided command buffer as a side effect.

Usage Examples

// Internal: called from the Metal backend's graph_compute implementation
ggml_metal_op_t op = ggml_metal_op_init(
    dev, cmd_buf, gf,
    0, gf->n_nodes,
    /* use_fusion */ true,
    /* use_concurrency */ true,
    /* use_capture */ false,
    /* debug_graph */ 0,
    /* debug_fusion */ 0);

// Encoding happens during init; finalize by freeing
ggml_metal_op_free(op);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment