Principle:Ggml org Ggml Backend Scheduling
| Attribute | Value |
|---|---|
| Page Type | Principle |
| Full Name | Ggml_org_Ggml_Backend_Scheduling |
| Short Name | Backend_Scheduling |
| Domain Tags | ML_Infrastructure, Hardware_Abstraction |
| Knowledge Source | GGML |
| Last Updated | 2025-05-15 12:00 GMT |
Overview
Description
Backend Scheduling is the principle of orchestrating the execution of computation graphs across multiple heterogeneous hardware backends. In the context of GGML, this refers to the strategy and mechanisms used to distribute tensor operations across available devices such as GPUs (CUDA, Metal, Vulkan, etc.) and CPUs, transparently and efficiently.
The core challenge is to take a single computation graph -- representing an entire model inference or training step -- and determine which backend should execute each operation, how tensors should be placed and transferred between devices, and how the resulting sub-graphs should be ordered and dispatched. Backend scheduling transforms a device-agnostic computation graph into a concrete, multi-device execution plan.
Usage
Backend scheduling is employed whenever a GGML-based application utilizes more than one compute backend, or when a single backend cannot host all tensors and operations of a given graph. Typical scenarios include:
- Multi-GPU inference: Splitting large language model layers across multiple GPUs using priority-based backend assignment.
- GPU+CPU offloading: Running operations that fit in GPU memory on the GPU and falling back to CPU for the remainder, controlled by the
op_offloadparameter. - Hybrid accelerator setups: Mixing different accelerator types (e.g., a CUDA GPU and a Vulkan GPU) within a single computation, with the scheduler transparently managing tensor copies and synchronization.
The scheduler is designed so that applications specify the set of available backends in priority order, and the scheduling system handles all placement and data movement decisions automatically.
Theoretical Basis
Backend scheduling in GGML draws upon several well-established concepts in systems research and heterogeneous computing:
Heterogeneous Computing
Heterogeneous computing involves coordinating workloads across processors with different architectures and capabilities. The GGML scheduler embodies this by treating each backend (GPU, CPU, specialized accelerator) as an abstract compute unit with its own supported operations, memory, and buffer types. The scheduler must decide which unit executes each operation, balancing throughput, latency, and memory constraints.
Graph Partitioning
The scheduler performs graph splitting: decomposing the full computation graph into contiguous subgraphs (called splits), each assigned to a single backend. This is a form of graph partitioning, where the objective is to minimize inter-partition data transfers (tensor copies between devices) while respecting the constraint that each operation must run on a backend that supports it. Each split boundary implies a synchronization point and potential data movement.
Device Placement
Device placement is the problem of assigning each tensor and operation to a specific device. In GGML, placement is determined by a priority system: backends with lower index have higher priority. The scheduler assigns each operation to the highest-priority backend that supports it. Users may also manually override placement for specific tensors. Weight tensors are preferentially placed on the same backend where their buffer resides, reducing unnecessary copies.
Priority-Based Assignment
The GGML scheduler uses a strict priority ordering among backends. The backend array passed during initialization defines this order: index 0 has the highest priority, and the last backend (which must be a CPU backend) serves as the universal fallback. This design guarantees that every operation can be executed, since the CPU backend supports all operations, while preferring faster accelerators when available.
Parallel Copy Optimization
When parallel mode is enabled, the scheduler can overlap data transfers with computation by maintaining multiple copies of inter-device tensors. This is a form of double-buffering or pipelining, reducing the latency cost of tensor transfers between backends during sequential graph evaluations.