Workflow:Kserve Kserve LLM Disaggregated Serving

Knowledge Sources	KServe KServe Website LLM Disaggregated Serving
Domains	LLM_Serving, Kubernetes, GPU_Inference, Distributed_Systems, RDMA
Last Updated	2026-02-13 14:00 GMT

Overview

End-to-end process for deploying large language models with disaggregated prefill-decode architecture, separating compute-intensive prefill and memory-intensive decode phases across dedicated GPU pools with KV cache transfer.

Description

This workflow covers the advanced deployment pattern for LLM inference where the prefill phase (processing the input prompt to generate KV cache) and the decode phase (generating output tokens sequentially) are separated into dedicated worker pools. Prefill pods handle the compute-intensive prompt processing, then transfer the KV cache to decode pods via RDMA (Remote Direct Memory Access) for efficient token generation. This architecture optimizes resource utilization by allowing each phase to be scaled and tuned independently. It supports both single-node GPU setups and multi-node deployments with data parallelism and expert parallelism for Mixture-of-Experts models like DeepSeek-R1.

Usage

Execute this workflow when deploying LLMs in high-concurrency production environments where the prefill phase is a bottleneck, or when serving large Mixture-of-Experts models that require multi-node data-parallel and expert-parallel execution. This is appropriate for clusters with RDMA-capable networking (RoCE or InfiniBand) and dedicated GPU resources. Use this when you need maximum throughput and minimal latency for generative AI workloads.

Execution Steps

Step 1: Prepare cluster with RDMA networking

Configure the Kubernetes cluster with RDMA-capable network infrastructure. Apply SR-IOV network node policies and network attachment definitions for RoCE (RDMA over Converged Ethernet). Verify that GPU nodes have RDMA device resources (rdma/roce_gdr) available and that the network operator is properly configured.

Key considerations:

SR-IOV network node policy must match the cluster's physical NIC configuration
Network attachment definition specifies the RDMA resource name
Each worker pod will request one rdma/roce_gdr resource
Verify RDMA device availability on nodes with kubectl describe node

Step 2: Prepare model weights on PVC

For large models (e.g., DeepSeek-R1 at 671B parameters), pre-download model weights to a PersistentVolumeClaim rather than downloading at pod startup. Create a PVC with sufficient storage and run a download Job that fetches the model from HuggingFace Hub to the PVC.

Key considerations:

Large models may require 1-2TB of storage
Use huggingface-cli with resume-download for reliable large transfers
The PVC must be ReadWriteMany or ReadOnlyMany for multi-pod access
Download can be done once and shared across all worker pools

Step 3: Write the LLMInferenceService with prefill-decode separation

Author the LLMInferenceService YAML manifest with separate workerSpec sections for the main (decode) pool and prefill pool. Configure the scheduler with prefill-decode profile handler, prefix cache scoring, and load-aware routing. Set the data parallelism degree, tensor parallelism, and all-to-all backend for MoE models.

Key components to configure:

Main workerSpec for decode workers with replica count and GPU allocation
Prefill workerSpec with separate replica count and potentially different resource limits
Scheduler configuration with pd-profile-handler, prefix-cache-scorer, and load-aware-scorer
KV cache transfer settings (NixlConnector) for RDMA-based cache movement
UCX transport layer configuration for optimal RDMA performance
GPU memory utilization ratios (typically 0.95-0.99)

Step 4: Apply and monitor deployment

Submit the LLMInferenceService manifest. The controller creates separate StatefulSets for prefill and decode worker pools, a scheduler deployment, and routing infrastructure. Monitor pod creation, model loading progress across all pools, and the overall service status.

What happens internally:

Decode worker pods are created from the main workerSpec template
Prefill worker pods are created from the prefill workerSpec template
Scheduler pod is deployed with prefill-decode routing configuration
Each pool downloads and initializes the model independently
RDMA connections are established between prefill and decode pods

Step 5: Validate prefill-decode routing

Once all pods are ready, send inference requests and verify that the scheduler correctly routes new requests to the prefill pool and continuation requests to the decode pool. Monitor scheduler logs for routing decisions and KV cache transfer events between pools.

Request routing flow:

New request arrives at the scheduler
Scheduler routes to a prefill worker based on load and prefix cache match
Prefill worker processes the prompt and generates KV cache
KV cache is transferred to a decode worker via RDMA
Decode worker generates output tokens using the transferred cache
Response is streamed back to the client

Step 6: Tune and scale pools independently

Adjust the replica count and resource allocation for prefill and decode pools based on observed workload patterns. Scale prefill replicas up if prompt processing is the bottleneck, or scale decode replicas if token generation throughput is insufficient. Tune scheduler parameters like prefix cache scoring weight and load-aware scoring weight.

Key tuning parameters:

Prefill pool replica count for concurrent prompt processing capacity
Decode pool replica count for concurrent generation throughput
GPU memory utilization ratio (higher values maximize model capacity)
Scheduler threshold for when to use prefill-decode separation vs. direct serving
All-to-all backend selection (deepep_high_throughput vs. pplx)

Execution Diagram

GitHub URL

Workflow Repository