Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:LMCache LMCache Decoder Instance Launch

From Leeroopedia


Knowledge Sources
Domains Serving, Distributed_Systems
Last Updated 2026-02-09 00:00 GMT

Overview

A deployment pattern for launching vLLM decoder instances configured as KV cache consumers in a disaggregated prefill-decode architecture.

Description

In disaggregated inference, decoder instances run the autoregressive decode phase. They are launched as standard vLLM serving instances with kv_role="kv_consumer" and the LMCache connector configured to receive KV caches from prefillers via NIXL. The decoder's PDBackend operates in "receiver" mode, listening on init and alloc ports for incoming NIXL connections and memory allocation requests.

Usage

Launch decoder instances after the proxy server is running and before prefillers. The decoder must have its LMCACHE_CONFIG_FILE pointing to a decoder-specific config (pd_role="receiver").

Theoretical Basis

The decoder receives KV cache via a two-step protocol:

  1. NIXL handshake: Prefiller connects to decoder's init port, exchanges buffer metadata
  2. Allocation request: Prefiller sends AllocRequest via ZMQ to decoder's alloc port; decoder allocates buffer space and responds with remote memory indices
  3. NIXL write: Prefiller writes KV data directly into decoder's GPU/CPU buffer via RDMA

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment