Implementation:Ggml org Ggml Metal ops
Metadata
| Field | Value |
|---|---|
| Page Type | Implementation (API Doc) |
| Knowledge Sources | GGML |
| Domains | ML_Infrastructure, Tensor_Computing, GPU_Computing |
| Last Updated | 2025-05-15 12:00 GMT |
Overview
Implements the Metal kernel dispatch logic for all supported GGML operations, including operation fusion and concurrent encoding on Apple GPUs.
Description
ggml-metal-ops.cpp is the largest file in the Metal backend (4300+ lines) and contains the core compute dispatch for every supported operation. It provides:
- Operation context (
ggml_metal_op): A class that manages a Metal command encoder and iterates over graph nodes. It filters empty operations, supports operation fusion (checking if consecutive ops can be merged viaggml_can_fuse_ext), and tracks memory ranges for concurrency viaggml_mem_ranges. - Buffer resolution: The
ggml_metal_get_buffer_idhelper resolves tensor buffer pointers, accounting for view sources, to obtain Metal buffer identifiers for kernel argument binding. - Per-operation dispatch: Each GGML operation has a dedicated dispatch function that:
- Selects the appropriate compiled Metal pipeline based on tensor types and quantization formats
- Populates the kernel argument struct (
ggml_metal_kargs_*) with tensor shape and stride metadata - Sets buffer bindings for source and destination tensors
- Dispatches threadgroups with appropriate dimensions
- Concurrency management: The
ggml_metal_op_concurrency_resetfunction resets memory range tracking when starting a new concurrent group. Operations that do not conflict in memory can be encoded concurrently.
Key operations dispatched include: matrix multiplication (mul_mat, mul_mv), flash attention, element-wise operations, RoPE, softmax, layer normalization, quantization/dequantization, pooling, convolution, and many more.
Usage
This module is used internally by the Metal backend. It is called from the backend's graph_compute callback when executing a computation graph on Apple GPUs. User code interacts with it indirectly through the GGML backend scheduling API.
Code Reference
Source Location
GGML repo, file: src/ggml-metal/ggml-metal-ops.cpp (4303 lines).
Signatures
ggml_metal_op_t ggml_metal_op_init(
ggml_metal_device_t dev,
ggml_metal_cmd_buf_t cmd_buf,
ggml_cgraph * gf,
int idx_start, int idx_end,
bool use_fusion, bool use_concurrency, bool use_capture,
int debug_graph, int debug_fusion);
void ggml_metal_op_free(ggml_metal_op_t ctx);
int ggml_metal_op_n_nodes(ggml_metal_op_t ctx);
// Internal dispatch functions (static):
// ggml_metal_op_mul_mat(...)
// ggml_metal_op_flash_attn_ext(...)
// ggml_metal_op_encode(...)
Import
#include "ggml-metal-ops.h"
I/O Contract
Inputs
| Parameter | Type | Required | Description |
|---|---|---|---|
dev |
ggml_metal_device_t |
Yes | Metal device handle providing access to the shader library and device capabilities. |
cmd_buf |
ggml_metal_cmd_buf_t |
Yes | Metal command buffer into which compute commands are encoded. |
gf |
ggml_cgraph * |
Yes | The computation graph containing the operations to dispatch. |
idx_start / idx_end |
int |
Yes | Range of node indices within the graph to process. |
use_fusion |
bool |
Yes | Whether to attempt fusing consecutive compatible operations. |
use_concurrency |
bool |
Yes | Whether to enable concurrent kernel encoding for non-conflicting operations. |
Outputs
| Output | Type | Description |
|---|---|---|
| Op context | ggml_metal_op_t |
Opaque handle managing the encoding session. Compute commands are written into the provided command buffer as a side effect. |
Usage Examples
// Internal: called from the Metal backend's graph_compute implementation
ggml_metal_op_t op = ggml_metal_op_init(
dev, cmd_buf, gf,
0, gf->n_nodes,
/* use_fusion */ true,
/* use_concurrency */ true,
/* use_capture */ false,
/* debug_graph */ 0,
/* debug_fusion */ 0);
// Encoding happens during init; finalize by freeing
ggml_metal_op_free(op);