Principle:Ggml org Ggml Backend Scheduling

Attribute	Value
Page Type	Principle
Full Name	Ggml_org_Ggml_Backend_Scheduling
Short Name	Backend_Scheduling
Domain Tags	ML_Infrastructure, Hardware_Abstraction
Knowledge Source	GGML
Last Updated	2025-05-15 12:00 GMT

Overview

Description

Backend Scheduling is the principle of orchestrating the execution of computation graphs across multiple heterogeneous hardware backends. In the context of GGML, this refers to the strategy and mechanisms used to distribute tensor operations across available devices such as GPUs (CUDA, Metal, Vulkan, etc.) and CPUs, transparently and efficiently.

The core challenge is to take a single computation graph -- representing an entire model inference or training step -- and determine which backend should execute each operation, how tensors should be placed and transferred between devices, and how the resulting sub-graphs should be ordered and dispatched. Backend scheduling transforms a device-agnostic computation graph into a concrete, multi-device execution plan.

Usage

Backend scheduling is employed whenever a GGML-based application utilizes more than one compute backend, or when a single backend cannot host all tensors and operations of a given graph. Typical scenarios include:

Multi-GPU inference: Splitting large language model layers across multiple GPUs using priority-based backend assignment.
GPU+CPU offloading: Running operations that fit in GPU memory on the GPU and falling back to CPU for the remainder, controlled by the op_offload parameter.
Hybrid accelerator setups: Mixing different accelerator types (e.g., a CUDA GPU and a Vulkan GPU) within a single computation, with the scheduler transparently managing tensor copies and synchronization.

The scheduler is designed so that applications specify the set of available backends in priority order, and the scheduling system handles all placement and data movement decisions automatically.

Theoretical Basis

Backend scheduling in GGML draws upon several well-established concepts in systems research and heterogeneous computing:

Heterogeneous Computing

Heterogeneous computing involves coordinating workloads across processors with different architectures and capabilities. The GGML scheduler embodies this by treating each backend (GPU, CPU, specialized accelerator) as an abstract compute unit with its own supported operations, memory, and buffer types. The scheduler must decide which unit executes each operation, balancing throughput, latency, and memory constraints.

Graph Partitioning

The scheduler performs graph splitting: decomposing the full computation graph into contiguous subgraphs (called splits), each assigned to a single backend. This is a form of graph partitioning, where the objective is to minimize inter-partition data transfers (tensor copies between devices) while respecting the constraint that each operation must run on a backend that supports it. Each split boundary implies a synchronization point and potential data movement.

Device Placement

Device placement is the problem of assigning each tensor and operation to a specific device. In GGML, placement is determined by a priority system: backends with lower index have higher priority. The scheduler assigns each operation to the highest-priority backend that supports it. Users may also manually override placement for specific tensors. Weight tensors are preferentially placed on the same backend where their buffer resides, reducing unnecessary copies.

Priority-Based Assignment

The GGML scheduler uses a strict priority ordering among backends. The backend array passed during initialization defines this order: index 0 has the highest priority, and the last backend (which must be a CPU backend) serves as the universal fallback. This design guarantees that every operation can be executed, since the CPU backend supports all operations, while preferring faster accelerators when available.

Parallel Copy Optimization

When parallel mode is enabled, the scheduler can overlap data transfers with computation by maintaining multiple copies of inter-device tensors. This is a form of double-buffering or pipelining, reducing the latency cost of tensor transfers between backends during sequential graph evaluations.

Related Pages

Implemented By

Implementation:Ggml_org_Ggml_Ggml_backend_sched_new

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment