Principle:LMCache LMCache Disaggregated Proxy Routing
| Knowledge Sources | |
|---|---|
| Domains | Distributed_Systems, Serving |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
A proxy-based request routing pattern that coordinates the prefill-decode pipeline by forwarding requests to prefillers, receiving KV transfer notifications, and streaming decode responses.
Description
The Disaggregated Proxy sits between the client and the prefill/decode instances. It implements an OpenAI-compatible API that: (1) tokenizes the input, (2) sends the prefill request with decode endpoint information (disagg_spec), (3) waits for a ZMQ notification that KV transfer is complete, (4) appends the first token from prefill to the prompt, and (5) streams the decode response back to the client.
Usage
Deploy the proxy server before launching prefill and decode instances. It must know the endpoints of both prefiller and decoder instances.
Theoretical Basis
The proxy implements a two-phase pipeline:
- Prefill phase: Proxy sends request to prefiller with decode endpoint in disagg_spec. Prefiller computes attention and writes KV to decoder via NIXL.
- Decode phase: After receiving ZMQ notification, proxy sends augmented prompt (with first token) to decoder for autoregressive generation.