Implementation:Ggml org Llama cpp RPC Server
| Knowledge Sources | |
|---|---|
| Domains | Distributed, Networking |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
RPC server that exposes local ggml compute devices (GPUs, CPUs) over TCP for distributed inference across networked machines.
Description
Parses command-line arguments for host, port, memory limit, and device selection. Detects available ggml backend devices and optionally filters by user-specified device names or indices. Creates a cache directory for RPC data, then starts the `ggml_rpc_server` on the configured endpoint. Includes cross-platform utilities for directory creation and UTF-8 path handling on Windows. Deliberately avoids linking against `libcommon`, duplicating some utility functions locally to minimize dependencies.
Usage
Use this server to enable distributed LLM inference by allowing remote machines to contribute their compute resources (CUDA, Metal, CPU) to a central llama.cpp instance over the network. Run one RPC server per machine with available hardware, then connect from the client using the `--rpc` flag.
Code Reference
Source Location
- Repository: Ggml_org_Llama_cpp
- File: tools/rpc/rpc-server.cpp
- Lines: 1-337
Signature
// Main entry point
int main(int argc, char ** argv);
// Server parameters
struct rpc_server_params {
std::string host = "0.0.0.0";
int port = 50052;
size_t backend_mem = 0;
std::vector<std::string> devices;
};
Import
#include "ggml-rpc.h"
#include <string>
#include <vector>
#include <thread>
#include <regex>
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| -H, --host | string | No | Host address to bind to (default: 0.0.0.0) |
| -p, --port | int | No | TCP port to listen on (default: 50052) |
| -m, --mem | size_t | No | Maximum backend memory to expose (0 = unlimited) |
| -d, --dev | string | No | Device name or index to expose (can be repeated) |
Outputs
| Name | Type | Description |
|---|---|---|
| TCP server | network | Listening TCP server accepting RPC compute requests from llama.cpp clients |
| return code | int | 0 on clean shutdown, non-zero on error |
Usage Examples
# Start RPC server on default port, exposing all devices
./rpc-server
# Start on specific host and port, limit to CUDA device
./rpc-server -H 192.168.1.100 -p 50052 -d CUDA0
# Client-side usage: connect to remote RPC server
./llama-cli -m model.gguf --rpc 192.168.1.100:50052 -ngl 99