Implementation:Ggml org Ggml Metal ops

Metadata

Field	Value
Page Type	Implementation (API Doc)
Knowledge Sources	GGML
Domains	ML_Infrastructure, Tensor_Computing, GPU_Computing
Last Updated	2025-05-15 12:00 GMT

Overview

Implements the Metal kernel dispatch logic for all supported GGML operations, including operation fusion and concurrent encoding on Apple GPUs.

Description

ggml-metal-ops.cpp is the largest file in the Metal backend (4300+ lines) and contains the core compute dispatch for every supported operation. It provides:

Operation context (ggml_metal_op): A class that manages a Metal command encoder and iterates over graph nodes. It filters empty operations, supports operation fusion (checking if consecutive ops can be merged via ggml_can_fuse_ext), and tracks memory ranges for concurrency via ggml_mem_ranges.
Buffer resolution: The ggml_metal_get_buffer_id helper resolves tensor buffer pointers, accounting for view sources, to obtain Metal buffer identifiers for kernel argument binding.
Per-operation dispatch: Each GGML operation has a dedicated dispatch function that:
- Selects the appropriate compiled Metal pipeline based on tensor types and quantization formats
- Populates the kernel argument struct (ggml_metal_kargs_*) with tensor shape and stride metadata
- Sets buffer bindings for source and destination tensors
- Dispatches threadgroups with appropriate dimensions
Concurrency management: The ggml_metal_op_concurrency_reset function resets memory range tracking when starting a new concurrent group. Operations that do not conflict in memory can be encoded concurrently.

Key operations dispatched include: matrix multiplication (mul_mat, mul_mv), flash attention, element-wise operations, RoPE, softmax, layer normalization, quantization/dequantization, pooling, convolution, and many more.

Usage

This module is used internally by the Metal backend. It is called from the backend's graph_compute callback when executing a computation graph on Apple GPUs. User code interacts with it indirectly through the GGML backend scheduling API.

Code Reference

Source Location

GGML repo, file: src/ggml-metal/ggml-metal-ops.cpp (4303 lines).

Signatures

ggml_metal_op_t ggml_metal_op_init(
    ggml_metal_device_t dev,
    ggml_metal_cmd_buf_t cmd_buf,
    ggml_cgraph * gf,
    int idx_start, int idx_end,
    bool use_fusion, bool use_concurrency, bool use_capture,
    int debug_graph, int debug_fusion);

void ggml_metal_op_free(ggml_metal_op_t ctx);
int  ggml_metal_op_n_nodes(ggml_metal_op_t ctx);

// Internal dispatch functions (static):
// ggml_metal_op_mul_mat(...)
// ggml_metal_op_flash_attn_ext(...)
// ggml_metal_op_encode(...)

Import

#include "ggml-metal-ops.h"

I/O Contract

Inputs

Parameter	Type	Required	Description
`dev`	`ggml_metal_device_t`	Yes	Metal device handle providing access to the shader library and device capabilities.
`cmd_buf`	`ggml_metal_cmd_buf_t`	Yes	Metal command buffer into which compute commands are encoded.
`gf`	`ggml_cgraph *`	Yes	The computation graph containing the operations to dispatch.
`idx_start / idx_end`	`int`	Yes	Range of node indices within the graph to process.
`use_fusion`	`bool`	Yes	Whether to attempt fusing consecutive compatible operations.
`use_concurrency`	`bool`	Yes	Whether to enable concurrent kernel encoding for non-conflicting operations.

Outputs

Output	Type	Description
Op context	`ggml_metal_op_t`	Opaque handle managing the encoding session. Compute commands are written into the provided command buffer as a side effect.

Usage Examples

// Internal: called from the Metal backend's graph_compute implementation
ggml_metal_op_t op = ggml_metal_op_init(
    dev, cmd_buf, gf,
    0, gf->n_nodes,
    /* use_fusion */ true,
    /* use_concurrency */ true,
    /* use_capture */ false,
    /* debug_graph */ 0,
    /* debug_fusion */ 0);

// Encoding happens during init; finalize by freeing
ggml_metal_op_free(op);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment