Implementation:Ggml org Ggml Ggml backend sched graph compute

Metadata

Field	Value
Page Type	Implementation (API Doc)
Knowledge Sources	GGML
Domains	ML_Infrastructure, Hardware_Abstraction
Last Updated	2025-05-15 12:00 GMT

Overview

Concrete function for executing a computation graph across one or more hardware backends via the GGML backend scheduler. This is the primary synchronous entry point for graph execution in the GGML library.

Description

ggml_backend_sched_graph_compute is the synchronous graph execution function in GGML's backend scheduler. It takes a pre-built computation graph and executes all of its operations across the assigned hardware backends, blocking until all computation is complete.

Internally, the function performs the following steps:

Delegates to the async variant: Calls ggml_backend_sched_graph_compute_async(sched, graph), which handles allocation and dispatch.
Auto-allocation: If the graph has not been previously allocated (i.e., sched->is_alloc is false), the async function automatically calls ggml_backend_sched_alloc_graph() to allocate memory for all tensors in the graph. If allocation fails, GGML_STATUS_ALLOC_FAILED is returned immediately.
Graph splitting and dispatch: The async function calls ggml_backend_sched_compute_splits(), which iterates over the pre-computed graph splits. Each split is a contiguous subgraph assigned to a single backend. For each split, the corresponding backend's graph_compute method is invoked.
Synchronization: After the async function returns, ggml_backend_sched_synchronize(sched) is called, which iterates over all backends in the scheduler and calls ggml_backend_synchronize() on each. This ensures all asynchronous work has completed before the function returns.

The function returns a ggml_status enum indicating success or the nature of any failure.

After execution completes, output tensors remain in their backend-specific buffers. To retrieve results to host memory, the caller uses ggml_backend_tensor_get().

Code Reference

Source Location

GGML repo, file: src/ggml-backend.cpp, lines L1787-1791.

Signature

enum ggml_status ggml_backend_sched_graph_compute(
    ggml_backend_sched_t sched,
    struct ggml_cgraph * graph
);

Import

#include "ggml-backend.h"

Language

C

Dependencies

ggml-backend.h -- Public API header declaring the function.
ggml-backend-impl.h -- Internal header defining backend interface structures used by the scheduler.

I/O Contract

Inputs

Parameter	Type	Required	Description
`sched`	`ggml_backend_sched_t`	Yes	Handle to the backend scheduler. Must have been created with `ggml_backend_sched_new()` and configured with one or more backends.
`graph`	`struct ggml_cgraph *`	Yes	Pointer to the computation graph to execute. The graph must have been built using GGML tensor operations in a valid `ggml_context`.

Outputs

Output	Type	Description
return value	`enum ggml_status`	Status code indicating the result of the computation. `GGML_STATUS_SUCCESS` on success; `GGML_STATUS_ALLOC_FAILED` if automatic graph allocation failed; other status codes for backend-specific errors.

Usage Examples

Basic Graph Execution

#include "ggml.h"
#include "ggml-backend.h"

// Assume backends and scheduler have been initialized:
//   ggml_backend_t backend_gpu = ...;
//   ggml_backend_t backend_cpu = ...;
//   ggml_backend_sched_t sched = ggml_backend_sched_new(...);

// Build the computation graph
struct ggml_context * ctx = ggml_init(params);
struct ggml_tensor * a = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, 128, 64);
struct ggml_tensor * b = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, 128, 64);
struct ggml_tensor * result = ggml_mul_mat(ctx, a, b);

struct ggml_cgraph * graph = ggml_new_graph(ctx);
ggml_build_forward_expand(graph, result);

// Execute the graph (allocation happens automatically on first call)
enum ggml_status status = ggml_backend_sched_graph_compute(sched, graph);
if (status != GGML_STATUS_SUCCESS) {
    // handle error
}

// Retrieve the result from the backend buffer to host memory
float * output_data = malloc(ggml_nbytes(result));
ggml_backend_tensor_get(result, output_data, 0, ggml_nbytes(result));

Repeated Execution (Inference Loop)

#include "ggml.h"
#include "ggml-backend.h"

// Build the graph once
struct ggml_cgraph * graph = build_graph(sched);

// Execute the same graph structure repeatedly.
// On the first iteration the graph is allocated automatically.
// Subsequent iterations reuse the allocation.
for (int i = 0; i < num_iterations; ++i) {
    enum ggml_status status = ggml_backend_sched_graph_compute(sched, graph);
    if (status != GGML_STATUS_SUCCESS) {
        break;
    }
}

Execution with Explicit Allocation and Input Setting

#include "ggml.h"
#include "ggml-backend.h"

// Build a new graph
struct ggml_cgraph * graph = build_graph(sched);

// Reset the scheduler to clear any previous allocation
ggml_backend_sched_reset(sched);

// Explicitly allocate the graph (without executing)
ggml_backend_sched_alloc_graph(sched, graph);

// Set input data on the allocated tensors
ggml_backend_tensor_set(input_tensor, input_data, 0, ggml_nbytes(input_tensor));

// Execute the pre-allocated graph
enum ggml_status status = ggml_backend_sched_graph_compute(sched, graph);

// Retrieve output
float * output = malloc(ggml_nbytes(output_tensor));
ggml_backend_tensor_get(output_tensor, output, 0, ggml_nbytes(output_tensor));

Retrieving Results with ggml_backend_tensor_get

After graph execution, output tensors reside in backend-specific memory (e.g., GPU VRAM). The ggml_backend_tensor_get function copies tensor data from the backend buffer to a caller-provided host memory buffer.

// Signature (src/ggml-backend.cpp:L297-310):
void ggml_backend_tensor_get(
    const struct ggml_tensor * tensor,
    void * data,
    size_t offset,
    size_t size
);

// Example: retrieve the full contents of an output tensor
float * host_buffer = malloc(ggml_nbytes(result));
ggml_backend_tensor_get(result, host_buffer, 0, ggml_nbytes(result));

// Example: retrieve a partial slice (e.g., first 256 bytes)
float * partial = malloc(256);
ggml_backend_tensor_get(result, partial, 0, 256);

The function asserts that the tensor has a valid buffer, that the tensor data pointer is non-null, and that offset + size does not exceed the tensor's byte size. Internally, it delegates to the buffer interface's get_tensor method, which handles the device-to-host transfer for the specific backend (e.g., cudaMemcpy for CUDA, direct memory copy for CPU).

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment