Implementation:Ggml org Ggml Rpc backend
Metadata
| Field | Value |
|---|---|
| Page Type | Implementation (API Doc) |
| Knowledge Sources | GGML |
| Domains | ML_Infrastructure, Tensor_Computing, Distributed_Computing |
| Last Updated | 2025-05-15 12:00 GMT |
Overview
Implements a network-transparent backend that forwards GGML operations over TCP sockets to a remote server for distributed inference.
Description
ggml-rpc.cpp implements both the client and server sides of the GGML RPC backend in approximately 2,100 lines. The key components include:
- Binary protocol: All RPC structures are packed (
#pragma pack(push, 1)) for wire-compatible binary serialization. Therpc_tensorstruct serializes tensor metadata (id, type, buffer pointer, dimensions, strides, op, op_params, source tensor IDs, view information, and name). Its size must be a multiple of 8 bytes. - RPC commands (17 total): The protocol defines:
- Buffer management:
ALLOC_BUFFER,FREE_BUFFER,BUFFER_GET_BASE,BUFFER_CLEAR - Tensor operations:
SET_TENSOR,SET_TENSOR_HASH,GET_TENSOR,COPY_TENSOR,INIT_TENSOR - Computation:
GRAPH_COMPUTE,GRAPH_RECOMPUTE - Device queries:
GET_ALIGNMENT,GET_MAX_SIZE,GET_DEVICE_MEMORY,GET_ALLOC_SIZE - Protocol:
HELLO(version handshake, fixed at command index 14),DEVICE_COUNT
- Buffer management:
- Hash-based deduplication: For tensor data larger than 10 MB (
HASH_THRESHOLD), the client attemptsSET_TENSOR_HASHfirst, which sends only a hash of the data. If the server already has the data cached, no transfer is needed. - Cross-platform networking: Uses a socket abstraction supporting both Windows (Winsock2) and POSIX sockets with RAII cleanup. Large transfers are chunked at 1 GiB maximum.
- Graph caching: A
graph_cachestruct on the server allowsGRAPH_RECOMPUTEto re-execute a previously submitted graph without resending the topology.
Usage
Client-side usage involves calling ggml_backend_rpc_init(endpoint, device) to connect to a remote server. Server-side usage involves calling ggml_backend_rpc_start_server() with local backend devices to expose them over the network.
Code Reference
Source Location
GGML repo, file: src/ggml-rpc/ggml-rpc.cpp (2118 lines).
Signatures
ggml_backend_t ggml_backend_rpc_init(const char * endpoint, uint32_t device);
bool ggml_backend_is_rpc(ggml_backend_t backend);
ggml_backend_buffer_type_t ggml_backend_rpc_buffer_type(const char * endpoint, uint32_t device);
void ggml_backend_rpc_get_device_memory(const char * endpoint, uint32_t device, size_t * free, size_t * total);
void ggml_backend_rpc_start_server(const char * endpoint, const char * cache_dir,
size_t n_threads, size_t n_devices, ggml_backend_dev_t * devices);
ggml_backend_reg_t ggml_backend_rpc_reg(void);
ggml_backend_reg_t ggml_backend_rpc_add_server(const char * endpoint);
Import
#include "ggml-rpc.h"
I/O Contract
Inputs
| Parameter | Type | Required | Description |
|---|---|---|---|
endpoint |
const char * |
Yes | Network address in host:port format for the RPC server connection.
|
device |
uint32_t |
Yes | Device index on the remote server (0-based). |
cache_dir |
const char * |
No | Directory for caching tensor data on the server side (enables hash-based deduplication). |
n_threads |
size_t |
Yes | Number of server worker threads. |
n_devices |
size_t |
Yes | Number of local backend devices to expose via the server. |
devices |
ggml_backend_dev_t * |
Yes | Array of local backend devices to serve. |
Outputs
| Output | Type | Description |
|---|---|---|
| Backend handle | ggml_backend_t |
Opaque handle to the RPC client backend that proxies all operations to the remote server. |
| Buffer type | ggml_backend_buffer_type_t |
Buffer type for remote device memory allocation. |
| Device memory | size_t * free, size_t * total |
Free and total memory on the remote device (via output parameters). |
Usage Examples
#include "ggml-rpc.h"
#include "ggml-backend.h"
// Client: connect to a remote RPC server
ggml_backend_t rpc_backend = ggml_backend_rpc_init("192.168.1.100:50052", 0);
if (rpc_backend && ggml_backend_is_rpc(rpc_backend)) {
// Query remote device memory
size_t free_mem, total_mem;
ggml_backend_rpc_get_device_memory("192.168.1.100:50052", 0, &free_mem, &total_mem);
// Use with scheduler just like any local backend
ggml_backend_sched_t sched = ggml_backend_sched_new(
&rpc_backend, NULL, 1, GGML_DEFAULT_GRAPH_SIZE, false);
ggml_backend_sched_graph_compute(sched, graph);
ggml_backend_sched_free(sched);
ggml_backend_free(rpc_backend);
}
// Server: expose local GPU backends over the network
ggml_backend_dev_t devices[2] = { cuda_dev_0, cuda_dev_1 };
ggml_backend_rpc_start_server("0.0.0.0:50052", "/tmp/rpc_cache", 4, 2, devices);