Principle:LMCache LMCache Decoder Instance Launch
| Knowledge Sources | |
|---|---|
| Domains | Serving, Distributed_Systems |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
A deployment pattern for launching vLLM decoder instances configured as KV cache consumers in a disaggregated prefill-decode architecture.
Description
In disaggregated inference, decoder instances run the autoregressive decode phase. They are launched as standard vLLM serving instances with kv_role="kv_consumer" and the LMCache connector configured to receive KV caches from prefillers via NIXL. The decoder's PDBackend operates in "receiver" mode, listening on init and alloc ports for incoming NIXL connections and memory allocation requests.
Usage
Launch decoder instances after the proxy server is running and before prefillers. The decoder must have its LMCACHE_CONFIG_FILE pointing to a decoder-specific config (pd_role="receiver").
Theoretical Basis
The decoder receives KV cache via a two-step protocol:
- NIXL handshake: Prefiller connects to decoder's init port, exchanges buffer metadata
- Allocation request: Prefiller sends AllocRequest via ZMQ to decoder's alloc port; decoder allocates buffer space and responds with remote memory indices
- NIXL write: Prefiller writes KV data directly into decoder's GPU/CPU buffer via RDMA