Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:LMCache LMCache Disaggregated Prefill

From Leeroopedia


Knowledge Sources
Domains LLM_Serving, Distributed_Inference, KV_Cache
Last Updated 2026-02-09 00:00 GMT

Overview

End-to-end process for deploying disaggregated prefill/decode architecture using LMCache with NIXL, separating the compute-intensive prefill phase from the latency-sensitive decode phase across dedicated GPU pools.

Description

This workflow implements the disaggregated prefill architecture where prefill (prompt processing) and decode (token generation) phases run on separate vLLM instances connected via high-performance NIXL (NVIDIA Inference eXtension Library) transfers. A proxy server routes incoming requests to prefiller instances, which compute the KV cache for the prompt. Once complete, the KV cache is transferred to a decoder instance via NIXL, and the decoder generates output tokens. This separation allows independent scaling of prefill and decode resources, optimizing GPU utilization for both compute-bound and memory-bound phases. The architecture supports 1-prefiller-1-decoder (1p1d) configurations for basic deployments and scales to xPyD (multiple prefillers and decoders) for production workloads.

Usage

Execute this workflow when you need to serve LLMs at scale with long prompts and want to independently optimize prefill throughput and decode latency. This is ideal for production deployments where prefill is the bottleneck (e.g., long-context RAG, document summarization) and you want to scale prefill capacity independently of decode capacity. Requires at least 2 GPUs (one for prefill, one for decode) and NIXL installed.

Execution Steps

Step 1: Validate Prerequisites

Verify that the deployment environment meets all requirements: sufficient GPU count (minimum 2 for 1p1d, 4+ for xPyD), HuggingFace token for model access, and required Python libraries (lmcache, nixl, vllm, pandas, datasets). Check that NIXL is properly installed with RDMA or TCP transport support for KV cache transfers between instances.

Key considerations:

  • NIXL requires specific network configurations for RDMA transport
  • TCP transport is available as a fallback but with reduced performance
  • Each prefiller and decoder requires its own GPU
  • Tensor parallelism configurations require additional GPUs per instance

Step 2: Configure LMCache for Prefill and Decode Roles

Create separate YAML configuration files for prefiller and decoder instances. The prefiller config specifies the NIXL storage backend with "kv_producer" role and its NIXL port bindings. The decoder config specifies "kv_consumer" role with corresponding NIXL ports. Both share the same chunk size and model settings but differ in their cache transfer roles and network endpoints.

Key considerations:

  • Prefillers produce KV caches; decoders consume them
  • NIXL ports must not conflict between instances on the same host
  • ZMQ ports are used for coordination signals between proxy and prefillers
  • Configuration supports multi-host deployments via host/port CSV arguments

Step 3: Launch the Proxy Server

Start the disaggregated prefill proxy server, which acts as the API gateway. The proxy exposes OpenAI-compatible endpoints and coordinates the prefill-decode pipeline: tokenizing incoming prompts, routing to prefillers via round-robin, waiting for KV transfer completion via ZMQ notifications, and forwarding the decode request to a decoder. Configure prefiller and decoder host/port lists, ZMQ coordination ports, and NIXL transfer ports.

Key considerations:

  • The proxy handles the complete request lifecycle across prefill and decode phases
  • Round-robin distribution ensures even load across prefiller/decoder pools
  • Session-bound routing is available for multi-turn conversations via custom headers
  • The proxy streams the final response back to the client

Step 4: Launch Decoder Instances

Start the vLLM decoder instances with LMCache configuration pointing to the decoder YAML. Each decoder runs on its own GPU and waits for KV caches to be transferred from prefillers. The decoder handles token generation once it receives the prefilled KV cache via NIXL.

Key considerations:

  • Decoders should be started before prefillers to ensure they are ready to receive KV caches
  • Each decoder registers its NIXL endpoint for KV cache reception
  • Multiple decoders can be launched for horizontal scaling of decode throughput
  • Health checking polls each server until it responds on its designated port

Step 5: Launch Prefiller Instances

Start the vLLM prefiller instances with LMCache configuration pointing to the prefiller YAML. Each prefiller processes incoming prompts, computes KV caches, and transfers them to the designated decoder via NIXL. After transferring, it sends a ZMQ notification to the proxy signaling KV readiness.

Key considerations:

  • Prefillers operate with max_tokens=1 since they only handle the prefill phase
  • The disagg_spec metadata in each request tells the prefiller which decoder to send KV caches to
  • Multiple prefillers can be launched for horizontal scaling of prefill throughput
  • Logs are captured to separate files for debugging each component

Step 6: Send Requests and Verify Operation

Send inference requests to the proxy server's OpenAI-compatible API endpoint. The proxy orchestrates the full pipeline: prefill on a prefiller instance, KV transfer via NIXL, and decode on a decoder instance. Verify operation by checking TTFT metrics and throughput logs. The proxy's StatsCalculator tracks and reports percentile latencies.

Key considerations:

  • Initial requests establish the NIXL connection and may have higher latency
  • Subsequent requests benefit from warmed connections and potential KV cache reuse
  • Benchmark tools can sweep context sizes to measure TTFT scaling
  • Session affinity headers enable multi-turn conversation support

Step 7: Graceful Shutdown

Terminate all processes using the orchestration script's cleanup mechanism, which handles signal propagation (SIGINT, SIGTERM) and kills all tracked PIDs including the proxy, prefillers, and decoders. Verify that all processes have exited and ports are released.

Key considerations:

  • The orchestration script traps signals for graceful shutdown
  • Process group termination ensures no orphaned processes
  • Log files are preserved for post-mortem analysis
  • Port cleanup is essential before restarting the system

Execution Diagram

GitHub URL

Workflow Repository