Principle:Kserve Kserve Disaggregated Deployment
| Knowledge Sources | |
|---|---|
| Domains | Distributed_Systems, LLM_Serving, GPU_Computing |
| Last Updated | 2026-02-13 00:00 GMT |
Overview
A multi-node deployment pattern that applies data parallelism, expert parallelism, and tensor parallelism to serve large Mixture-of-Experts models across GPU clusters.
Description
Disaggregated Deployment handles the operational complexity of deploying models like DeepSeek-R1 (600B+ parameters, MoE architecture) across multiple GPU nodes:
- Data Parallelism (DP): Replicates the model across node groups for throughput scaling.
- Expert Parallelism (EP): Distributes MoE experts across GPUs within a node.
- Tensor Parallelism (TP): Shards individual layers across GPUs.
The parallelism spec in the YAML defines: data: 32 (total DP ranks), dataLocal: 8 (per-node), expert: true, tensor: 1. Nodes = data/dataLocal. Leader Worker Set (LWS) manages multi-pod coordination.
Usage
Use for MoE models that exceed single-node GPU memory. Requires RDMA networking for inter-node communication via NCCL/NVSHMEM.
Theoretical Basis
# Parallelism model (NOT implementation code)
Given: data=32, dataLocal=8, expert=true, tensor=1
Nodes = data / dataLocal = 4
Per-node: 8 GPUs with expert parallelism
Total GPUs: 32
NCCL communication: inter-node via RDMA (IB/RoCE)
NVSHMEM: intra-node GPU-GPU communication
DeepEP all-to-all backend for MoE:
VLLM_ALL2ALL_BACKEND: "deepep_high_throughput"
Experts distributed across GPUs
Each request routes to relevant experts