Principle:LMCache LMCache Disaggregated Proxy Routing

Knowledge Sources	LMCache Splitwise
Domains	Distributed_Systems, Serving
Last Updated	2026-02-09 00:00 GMT

Overview

A proxy-based request routing pattern that coordinates the prefill-decode pipeline by forwarding requests to prefillers, receiving KV transfer notifications, and streaming decode responses.

Description

The Disaggregated Proxy sits between the client and the prefill/decode instances. It implements an OpenAI-compatible API that: (1) tokenizes the input, (2) sends the prefill request with decode endpoint information (disagg_spec), (3) waits for a ZMQ notification that KV transfer is complete, (4) appends the first token from prefill to the prompt, and (5) streams the decode response back to the client.

Usage

Deploy the proxy server before launching prefill and decode instances. It must know the endpoints of both prefiller and decoder instances.

Theoretical Basis

The proxy implements a two-phase pipeline:

Prefill phase: Proxy sends request to prefiller with decode endpoint in disagg_spec. Prefiller computes attention and writes KV to decoder via NIXL.
Decode phase: After receiving ZMQ notification, proxy sends augmented prompt (with first token) to decoder for autoregressive generation.

Related Pages

Implemented By

Implementation:LMCache_LMCache_Disagg_Proxy_Server

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment