Principle:LMCache LMCache Decoder Instance Launch

Knowledge Sources	LMCache vLLM
Domains	Serving, Distributed_Systems
Last Updated	2026-02-09 00:00 GMT

Overview

A deployment pattern for launching vLLM decoder instances configured as KV cache consumers in a disaggregated prefill-decode architecture.

Description

In disaggregated inference, decoder instances run the autoregressive decode phase. They are launched as standard vLLM serving instances with kv_role="kv_consumer" and the LMCache connector configured to receive KV caches from prefillers via NIXL. The decoder's PDBackend operates in "receiver" mode, listening on init and alloc ports for incoming NIXL connections and memory allocation requests.

Usage

Launch decoder instances after the proxy server is running and before prefillers. The decoder must have its LMCACHE_CONFIG_FILE pointing to a decoder-specific config (pd_role="receiver").

Theoretical Basis

The decoder receives KV cache via a two-step protocol:

NIXL handshake: Prefiller connects to decoder's init port, exchanges buffer metadata
Allocation request: Prefiller sends AllocRequest via ZMQ to decoder's alloc port; decoder allocates buffer space and responds with remote memory indices
NIXL write: Prefiller writes KV data directly into decoder's GPU/CPU buffer via RDMA

Related Pages

Implemented By

Implementation:LMCache_LMCache_VLLM_Serve_Decoder

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment