Principle:LMCache LMCache Prefiller Instance Launch
| Knowledge Sources | |
|---|---|
| Domains | Serving, Distributed_Systems |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
A deployment pattern for launching vLLM prefiller instances configured as KV cache producers in a disaggregated prefill-decode architecture.
Description
Prefiller instances run the prefill phase (full attention computation) and transfer the resulting KV cache to decoder instances. They are launched with kv_role="kv_producer" and the LMCache connector in sender mode. After computing attention, the PDBackend sends KV data to the decoder via NIXL and notifies the proxy via ZMQ.
Usage
Launch prefiller instances after both the proxy server and decoder instances are running. The prefiller must have LMCACHE_CONFIG_FILE pointing to a prefiller-specific config (pd_role="sender").
Theoretical Basis
The prefiller performs a compute-and-transfer cycle:
- Receive request from proxy with disagg_spec containing decoder endpoint
- Run full attention prefill on the prompt
- Connect to decoder's NIXL endpoint (if not already connected)
- Send AllocRequest to decoder, receive AllocResponse with buffer indices
- Write KV cache chunks to decoder via NIXL batched_write
- Send ProxyNotif via ZMQ to signal transfer completion