Workflow:Kserve Kserve LLM Disaggregated Serving
| Knowledge Sources | |
|---|---|
| Domains | LLM_Serving, Kubernetes, GPU_Inference, Distributed_Systems, RDMA |
| Last Updated | 2026-02-13 14:00 GMT |
Overview
End-to-end process for deploying large language models with disaggregated prefill-decode architecture, separating compute-intensive prefill and memory-intensive decode phases across dedicated GPU pools with KV cache transfer.
Description
This workflow covers the advanced deployment pattern for LLM inference where the prefill phase (processing the input prompt to generate KV cache) and the decode phase (generating output tokens sequentially) are separated into dedicated worker pools. Prefill pods handle the compute-intensive prompt processing, then transfer the KV cache to decode pods via RDMA (Remote Direct Memory Access) for efficient token generation. This architecture optimizes resource utilization by allowing each phase to be scaled and tuned independently. It supports both single-node GPU setups and multi-node deployments with data parallelism and expert parallelism for Mixture-of-Experts models like DeepSeek-R1.
Usage
Execute this workflow when deploying LLMs in high-concurrency production environments where the prefill phase is a bottleneck, or when serving large Mixture-of-Experts models that require multi-node data-parallel and expert-parallel execution. This is appropriate for clusters with RDMA-capable networking (RoCE or InfiniBand) and dedicated GPU resources. Use this when you need maximum throughput and minimal latency for generative AI workloads.
Execution Steps
Step 1: Prepare cluster with RDMA networking
Configure the Kubernetes cluster with RDMA-capable network infrastructure. Apply SR-IOV network node policies and network attachment definitions for RoCE (RDMA over Converged Ethernet). Verify that GPU nodes have RDMA device resources (rdma/roce_gdr) available and that the network operator is properly configured.
Key considerations:
- SR-IOV network node policy must match the cluster's physical NIC configuration
- Network attachment definition specifies the RDMA resource name
- Each worker pod will request one rdma/roce_gdr resource
- Verify RDMA device availability on nodes with kubectl describe node
Step 2: Prepare model weights on PVC
For large models (e.g., DeepSeek-R1 at 671B parameters), pre-download model weights to a PersistentVolumeClaim rather than downloading at pod startup. Create a PVC with sufficient storage and run a download Job that fetches the model from HuggingFace Hub to the PVC.
Key considerations:
- Large models may require 1-2TB of storage
- Use huggingface-cli with resume-download for reliable large transfers
- The PVC must be ReadWriteMany or ReadOnlyMany for multi-pod access
- Download can be done once and shared across all worker pools
Step 3: Write the LLMInferenceService with prefill-decode separation
Author the LLMInferenceService YAML manifest with separate workerSpec sections for the main (decode) pool and prefill pool. Configure the scheduler with prefill-decode profile handler, prefix cache scoring, and load-aware routing. Set the data parallelism degree, tensor parallelism, and all-to-all backend for MoE models.
Key components to configure:
- Main workerSpec for decode workers with replica count and GPU allocation
- Prefill workerSpec with separate replica count and potentially different resource limits
- Scheduler configuration with pd-profile-handler, prefix-cache-scorer, and load-aware-scorer
- KV cache transfer settings (NixlConnector) for RDMA-based cache movement
- UCX transport layer configuration for optimal RDMA performance
- GPU memory utilization ratios (typically 0.95-0.99)
Step 4: Apply and monitor deployment
Submit the LLMInferenceService manifest. The controller creates separate StatefulSets for prefill and decode worker pools, a scheduler deployment, and routing infrastructure. Monitor pod creation, model loading progress across all pools, and the overall service status.
What happens internally:
- Decode worker pods are created from the main workerSpec template
- Prefill worker pods are created from the prefill workerSpec template
- Scheduler pod is deployed with prefill-decode routing configuration
- Each pool downloads and initializes the model independently
- RDMA connections are established between prefill and decode pods
Step 5: Validate prefill-decode routing
Once all pods are ready, send inference requests and verify that the scheduler correctly routes new requests to the prefill pool and continuation requests to the decode pool. Monitor scheduler logs for routing decisions and KV cache transfer events between pools.
Request routing flow:
- New request arrives at the scheduler
- Scheduler routes to a prefill worker based on load and prefix cache match
- Prefill worker processes the prompt and generates KV cache
- KV cache is transferred to a decode worker via RDMA
- Decode worker generates output tokens using the transferred cache
- Response is streamed back to the client
Step 6: Tune and scale pools independently
Adjust the replica count and resource allocation for prefill and decode pools based on observed workload patterns. Scale prefill replicas up if prompt processing is the bottleneck, or scale decode replicas if token generation throughput is insufficient. Tune scheduler parameters like prefix cache scoring weight and load-aware scoring weight.
Key tuning parameters:
- Prefill pool replica count for concurrent prompt processing capacity
- Decode pool replica count for concurrent generation throughput
- GPU memory utilization ratio (higher values maximize model capacity)
- Scheduler threshold for when to use prefill-decode separation vs. direct serving
- All-to-all backend selection (deepep_high_throughput vs. pplx)