Implementation:Kserve Kserve DP EP Deployment Pattern
Appearance
| Knowledge Sources | |
|---|---|
| Domains | Distributed_Systems, LLM_Serving, GPU_Computing |
| Last Updated | 2026-02-13 00:00 GMT |
Overview
Concrete YAML pattern for deploying large MoE LLMs with data parallelism and expert parallelism across multi-node GPU clusters.
Description
The DP+EP deployment pattern uses the LLMInferenceService parallelism spec and Leader Worker Set (LWS) to coordinate multi-node inference. Key environment variables configure NCCL for RDMA inter-node communication and NVSHMEM for GPU-direct access.
Usage
Use for models like DeepSeek-R1 that require multiple GPU nodes. Requires RDMA networking and SR-IOV configured.
Code Reference
Source Location
- Repository: kserve
- File: docs/samples/llmisvc/dp-ep/deepseek-r1-gpu-rdma-roce/llm-inference-service-dp-ep-deepseek-r1-gpu-deepep-ht.yaml, Lines 1-182
Signature
apiVersion: serving.kserve.io/v1alpha1
kind: LLMInferenceService
metadata:
name: deepseek-r1-dp-ep
spec:
model:
uri: "pvc://llm-test-pvc-deepseek"
name: "deepseek-ai/DeepSeek-R1-0528"
parallelism:
data: 32
dataLocal: 8
expert: true
tensor: 1
template:
spec:
containers:
- name: main
env:
- name: VLLM_ALL2ALL_BACKEND
value: "deepep_high_throughput"
- name: NCCL_IB_GID_INDEX
value: "3"
- name: NCCL_SOCKET_IFNAME
value: "net1"
- name: NVSHMEM_REMOTE_TRANSPORT
value: "ibgda"
- name: NVIDIA_GDRCOPY
value: "enabled"
resources:
limits:
nvidia.com/gpu: "8"
rdma/roce_gdr: 1
memory: 512Gi
Import
kubectl apply -f llm-inference-service-dp-ep-deepseek-r1-gpu-deepep-ht.yaml
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| parallelism.data | int | Yes | Total data parallel ranks |
| parallelism.dataLocal | int | Yes | Data parallel ranks per node |
| parallelism.expert | bool | No | Enable expert parallelism for MoE |
| parallelism.tensor | int | No | Tensor parallel degree |
| RDMA network | network | Yes | SR-IOV RDMA for inter-node NCCL |
Outputs
| Name | Type | Description |
|---|---|---|
| Worker pods | LeaderWorkerSet | Multi-node vLLM worker pods coordinated by LWS |
| InferencePool | CRD | Endpoint pool for scheduler |
| Distributed model | vLLM | Model sharded across all GPUs with DP+EP |
Usage Examples
Deploy DeepSeek-R1
# 1. Ensure RDMA and PVC are ready
kubectl get sriovnetworknodepolicy
kubectl get pvc llm-test-pvc-deepseek
# 2. Deploy
kubectl apply -f llm-inference-service-dp-ep-deepseek-r1-gpu-deepep-ht.yaml
# 3. Monitor (model loading takes ~80 min for 600B model)
kubectl get llminferenceservice -owide
kubectl get pods -l app.kubernetes.io/component=llminferenceservice-workload
# 4. Watch logs for NCCL initialization
kubectl logs <leader-pod> -c main | grep "NCCL"
Related Pages
Implements Principle
Requires Environment
- Environment:Kserve_Kserve_GPU_Accelerator
- Environment:Kserve_Kserve_SRIOV_RDMA_Network
- Environment:Kserve_Kserve_Leader_Worker_Set
Uses Heuristic
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment