Implementation:Ggml org Llama cpp RPC Server

Knowledge Sources	Ggml_org_Llama_cpp
Domains	Distributed, Networking
Last Updated	2026-02-15 00:00 GMT

Overview

RPC server that exposes local ggml compute devices (GPUs, CPUs) over TCP for distributed inference across networked machines.

Description

Parses command-line arguments for host, port, memory limit, and device selection. Detects available ggml backend devices and optionally filters by user-specified device names or indices. Creates a cache directory for RPC data, then starts the `ggml_rpc_server` on the configured endpoint. Includes cross-platform utilities for directory creation and UTF-8 path handling on Windows. Deliberately avoids linking against `libcommon`, duplicating some utility functions locally to minimize dependencies.

Usage

Use this server to enable distributed LLM inference by allowing remote machines to contribute their compute resources (CUDA, Metal, CPU) to a central llama.cpp instance over the network. Run one RPC server per machine with available hardware, then connect from the client using the `--rpc` flag.

Code Reference

Source Location

Repository: Ggml_org_Llama_cpp
File: tools/rpc/rpc-server.cpp
Lines: 1-337

Signature

// Main entry point
int main(int argc, char ** argv);

// Server parameters
struct rpc_server_params {
    std::string host = "0.0.0.0";
    int port = 50052;
    size_t backend_mem = 0;
    std::vector<std::string> devices;
};

Import

#include "ggml-rpc.h"
#include <string>
#include <vector>
#include <thread>
#include <regex>

I/O Contract

Inputs

Name	Type	Required	Description
-H, --host	string	No	Host address to bind to (default: 0.0.0.0)
-p, --port	int	No	TCP port to listen on (default: 50052)
-m, --mem	size_t	No	Maximum backend memory to expose (0 = unlimited)
-d, --dev	string	No	Device name or index to expose (can be repeated)

Outputs

Name	Type	Description
TCP server	network	Listening TCP server accepting RPC compute requests from llama.cpp clients
return code	int	0 on clean shutdown, non-zero on error

Usage Examples

# Start RPC server on default port, exposing all devices
./rpc-server

# Start on specific host and port, limit to CUDA device
./rpc-server -H 192.168.1.100 -p 50052 -d CUDA0

# Client-side usage: connect to remote RPC server
./llama-cli -m model.gguf --rpc 192.168.1.100:50052 -ngl 99

Related Pages

Principle:Ggml_org_Llama_cpp_Distributed_Inference

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment