Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ggml org Ggml Rpc backend

From Leeroopedia


Metadata

Field Value
Page Type Implementation (API Doc)
Knowledge Sources GGML
Domains ML_Infrastructure, Tensor_Computing, Distributed_Computing
Last Updated 2025-05-15 12:00 GMT

Overview

Implements a network-transparent backend that forwards GGML operations over TCP sockets to a remote server for distributed inference.

Description

ggml-rpc.cpp implements both the client and server sides of the GGML RPC backend in approximately 2,100 lines. The key components include:

  1. Binary protocol: All RPC structures are packed (#pragma pack(push, 1)) for wire-compatible binary serialization. The rpc_tensor struct serializes tensor metadata (id, type, buffer pointer, dimensions, strides, op, op_params, source tensor IDs, view information, and name). Its size must be a multiple of 8 bytes.
  2. RPC commands (17 total): The protocol defines:
    • Buffer management: ALLOC_BUFFER, FREE_BUFFER, BUFFER_GET_BASE, BUFFER_CLEAR
    • Tensor operations: SET_TENSOR, SET_TENSOR_HASH, GET_TENSOR, COPY_TENSOR, INIT_TENSOR
    • Computation: GRAPH_COMPUTE, GRAPH_RECOMPUTE
    • Device queries: GET_ALIGNMENT, GET_MAX_SIZE, GET_DEVICE_MEMORY, GET_ALLOC_SIZE
    • Protocol: HELLO (version handshake, fixed at command index 14), DEVICE_COUNT
  3. Hash-based deduplication: For tensor data larger than 10 MB (HASH_THRESHOLD), the client attempts SET_TENSOR_HASH first, which sends only a hash of the data. If the server already has the data cached, no transfer is needed.
  4. Cross-platform networking: Uses a socket abstraction supporting both Windows (Winsock2) and POSIX sockets with RAII cleanup. Large transfers are chunked at 1 GiB maximum.
  5. Graph caching: A graph_cache struct on the server allows GRAPH_RECOMPUTE to re-execute a previously submitted graph without resending the topology.

Usage

Client-side usage involves calling ggml_backend_rpc_init(endpoint, device) to connect to a remote server. Server-side usage involves calling ggml_backend_rpc_start_server() with local backend devices to expose them over the network.

Code Reference

Source Location

GGML repo, file: src/ggml-rpc/ggml-rpc.cpp (2118 lines).

Signatures

ggml_backend_t ggml_backend_rpc_init(const char * endpoint, uint32_t device);
bool ggml_backend_is_rpc(ggml_backend_t backend);
ggml_backend_buffer_type_t ggml_backend_rpc_buffer_type(const char * endpoint, uint32_t device);
void ggml_backend_rpc_get_device_memory(const char * endpoint, uint32_t device, size_t * free, size_t * total);
void ggml_backend_rpc_start_server(const char * endpoint, const char * cache_dir,
                                   size_t n_threads, size_t n_devices, ggml_backend_dev_t * devices);
ggml_backend_reg_t ggml_backend_rpc_reg(void);
ggml_backend_reg_t ggml_backend_rpc_add_server(const char * endpoint);

Import

#include "ggml-rpc.h"

I/O Contract

Inputs

Parameter Type Required Description
endpoint const char * Yes Network address in host:port format for the RPC server connection.
device uint32_t Yes Device index on the remote server (0-based).
cache_dir const char * No Directory for caching tensor data on the server side (enables hash-based deduplication).
n_threads size_t Yes Number of server worker threads.
n_devices size_t Yes Number of local backend devices to expose via the server.
devices ggml_backend_dev_t * Yes Array of local backend devices to serve.

Outputs

Output Type Description
Backend handle ggml_backend_t Opaque handle to the RPC client backend that proxies all operations to the remote server.
Buffer type ggml_backend_buffer_type_t Buffer type for remote device memory allocation.
Device memory size_t * free, size_t * total Free and total memory on the remote device (via output parameters).

Usage Examples

#include "ggml-rpc.h"
#include "ggml-backend.h"

// Client: connect to a remote RPC server
ggml_backend_t rpc_backend = ggml_backend_rpc_init("192.168.1.100:50052", 0);

if (rpc_backend && ggml_backend_is_rpc(rpc_backend)) {
    // Query remote device memory
    size_t free_mem, total_mem;
    ggml_backend_rpc_get_device_memory("192.168.1.100:50052", 0, &free_mem, &total_mem);

    // Use with scheduler just like any local backend
    ggml_backend_sched_t sched = ggml_backend_sched_new(
        &rpc_backend, NULL, 1, GGML_DEFAULT_GRAPH_SIZE, false);

    ggml_backend_sched_graph_compute(sched, graph);

    ggml_backend_sched_free(sched);
    ggml_backend_free(rpc_backend);
}
// Server: expose local GPU backends over the network
ggml_backend_dev_t devices[2] = { cuda_dev_0, cuda_dev_1 };
ggml_backend_rpc_start_server("0.0.0.0:50052", "/tmp/rpc_cache", 4, 2, devices);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment