Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:LMCache LMCache Disaggregated Proxy Routing

From Leeroopedia


Knowledge Sources
Domains Distributed_Systems, Serving
Last Updated 2026-02-09 00:00 GMT

Overview

A proxy-based request routing pattern that coordinates the prefill-decode pipeline by forwarding requests to prefillers, receiving KV transfer notifications, and streaming decode responses.

Description

The Disaggregated Proxy sits between the client and the prefill/decode instances. It implements an OpenAI-compatible API that: (1) tokenizes the input, (2) sends the prefill request with decode endpoint information (disagg_spec), (3) waits for a ZMQ notification that KV transfer is complete, (4) appends the first token from prefill to the prompt, and (5) streams the decode response back to the client.

Usage

Deploy the proxy server before launching prefill and decode instances. It must know the endpoints of both prefiller and decoder instances.

Theoretical Basis

The proxy implements a two-phase pipeline:

  1. Prefill phase: Proxy sends request to prefiller with decode endpoint in disagg_spec. Prefiller computes attention and writes KV to decoder via NIXL.
  2. Decode phase: After receiving ZMQ notification, proxy sends augmented prompt (with first token) to decoder for autoregressive generation.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment