Implementation:Ggml org Ggml Rpc backend

Metadata

Field	Value
Page Type	Implementation (API Doc)
Knowledge Sources	GGML
Domains	ML_Infrastructure, Tensor_Computing, Distributed_Computing
Last Updated	2025-05-15 12:00 GMT

Overview

Implements a network-transparent backend that forwards GGML operations over TCP sockets to a remote server for distributed inference.

Description

ggml-rpc.cpp implements both the client and server sides of the GGML RPC backend in approximately 2,100 lines. The key components include:

Binary protocol: All RPC structures are packed (#pragma pack(push, 1)) for wire-compatible binary serialization. The rpc_tensor struct serializes tensor metadata (id, type, buffer pointer, dimensions, strides, op, op_params, source tensor IDs, view information, and name). Its size must be a multiple of 8 bytes.
RPC commands (17 total): The protocol defines:
- Buffer management: ALLOC_BUFFER, FREE_BUFFER, BUFFER_GET_BASE, BUFFER_CLEAR
- Tensor operations: SET_TENSOR, SET_TENSOR_HASH, GET_TENSOR, COPY_TENSOR, INIT_TENSOR
- Computation: GRAPH_COMPUTE, GRAPH_RECOMPUTE
- Device queries: GET_ALIGNMENT, GET_MAX_SIZE, GET_DEVICE_MEMORY, GET_ALLOC_SIZE
- Protocol: HELLO (version handshake, fixed at command index 14), DEVICE_COUNT
Hash-based deduplication: For tensor data larger than 10 MB (HASH_THRESHOLD), the client attempts SET_TENSOR_HASH first, which sends only a hash of the data. If the server already has the data cached, no transfer is needed.
Cross-platform networking: Uses a socket abstraction supporting both Windows (Winsock2) and POSIX sockets with RAII cleanup. Large transfers are chunked at 1 GiB maximum.
Graph caching: A graph_cache struct on the server allows GRAPH_RECOMPUTE to re-execute a previously submitted graph without resending the topology.

Usage

Client-side usage involves calling ggml_backend_rpc_init(endpoint, device) to connect to a remote server. Server-side usage involves calling ggml_backend_rpc_start_server() with local backend devices to expose them over the network.

Code Reference

Source Location

GGML repo, file: src/ggml-rpc/ggml-rpc.cpp (2118 lines).

Signatures

ggml_backend_t ggml_backend_rpc_init(const char * endpoint, uint32_t device);
bool ggml_backend_is_rpc(ggml_backend_t backend);
ggml_backend_buffer_type_t ggml_backend_rpc_buffer_type(const char * endpoint, uint32_t device);
void ggml_backend_rpc_get_device_memory(const char * endpoint, uint32_t device, size_t * free, size_t * total);
void ggml_backend_rpc_start_server(const char * endpoint, const char * cache_dir,
                                   size_t n_threads, size_t n_devices, ggml_backend_dev_t * devices);
ggml_backend_reg_t ggml_backend_rpc_reg(void);
ggml_backend_reg_t ggml_backend_rpc_add_server(const char * endpoint);

Import

#include "ggml-rpc.h"

I/O Contract

Inputs

Parameter	Type	Required	Description
`endpoint`	`const char *`	Yes	Network address in `host:port` format for the RPC server connection.
`device`	`uint32_t`	Yes	Device index on the remote server (0-based).
`cache_dir`	`const char *`	No	Directory for caching tensor data on the server side (enables hash-based deduplication).
`n_threads`	`size_t`	Yes	Number of server worker threads.
`n_devices`	`size_t`	Yes	Number of local backend devices to expose via the server.
`devices`	`ggml_backend_dev_t *`	Yes	Array of local backend devices to serve.

Outputs

Output	Type	Description
Backend handle	`ggml_backend_t`	Opaque handle to the RPC client backend that proxies all operations to the remote server.
Buffer type	`ggml_backend_buffer_type_t`	Buffer type for remote device memory allocation.
Device memory	`size_t * free, size_t * total`	Free and total memory on the remote device (via output parameters).

Usage Examples

#include "ggml-rpc.h"
#include "ggml-backend.h"

// Client: connect to a remote RPC server
ggml_backend_t rpc_backend = ggml_backend_rpc_init("192.168.1.100:50052", 0);

if (rpc_backend && ggml_backend_is_rpc(rpc_backend)) {
    // Query remote device memory
    size_t free_mem, total_mem;
    ggml_backend_rpc_get_device_memory("192.168.1.100:50052", 0, &free_mem, &total_mem);

    // Use with scheduler just like any local backend
    ggml_backend_sched_t sched = ggml_backend_sched_new(
        &rpc_backend, NULL, 1, GGML_DEFAULT_GRAPH_SIZE, false);

    ggml_backend_sched_graph_compute(sched, graph);

    ggml_backend_sched_free(sched);
    ggml_backend_free(rpc_backend);
}

// Server: expose local GPU backends over the network
ggml_backend_dev_t devices[2] = { cuda_dev_0, cuda_dev_1 };
ggml_backend_rpc_start_server("0.0.0.0:50052", "/tmp/rpc_cache", 4, 2, devices);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment