Principle:Kserve Kserve Disaggregated Deployment

Knowledge Sources	Leader Worker Set vLLM Distributed Inference
Domains	Distributed_Systems, LLM_Serving, GPU_Computing
Last Updated	2026-02-13 00:00 GMT

Overview

A multi-node deployment pattern that applies data parallelism, expert parallelism, and tensor parallelism to serve large Mixture-of-Experts models across GPU clusters.

Description

Disaggregated Deployment handles the operational complexity of deploying models like DeepSeek-R1 (600B+ parameters, MoE architecture) across multiple GPU nodes:

Data Parallelism (DP): Replicates the model across node groups for throughput scaling.
Expert Parallelism (EP): Distributes MoE experts across GPUs within a node.
Tensor Parallelism (TP): Shards individual layers across GPUs.

The parallelism spec in the YAML defines: data: 32 (total DP ranks), dataLocal: 8 (per-node), expert: true, tensor: 1. Nodes = data/dataLocal. Leader Worker Set (LWS) manages multi-pod coordination.

Usage

Use for MoE models that exceed single-node GPU memory. Requires RDMA networking for inter-node communication via NCCL/NVSHMEM.

Theoretical Basis

# Parallelism model (NOT implementation code)
Given: data=32, dataLocal=8, expert=true, tensor=1
  Nodes = data / dataLocal = 4
  Per-node: 8 GPUs with expert parallelism
  Total GPUs: 32
  NCCL communication: inter-node via RDMA (IB/RoCE)
  NVSHMEM: intra-node GPU-GPU communication

DeepEP all-to-all backend for MoE:
  VLLM_ALL2ALL_BACKEND: "deepep_high_throughput"
  Experts distributed across GPUs
  Each request routes to relevant experts

Related Pages

Implemented By

Implementation:Kserve_Kserve_DP_EP_Deployment_Pattern

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment