Implementation:Kserve Kserve DP EP Deployment Pattern

Knowledge Sources	KServe vLLM Distributed Inference
Domains	Distributed_Systems, LLM_Serving, GPU_Computing
Last Updated	2026-02-13 00:00 GMT

Overview

Concrete YAML pattern for deploying large MoE LLMs with data parallelism and expert parallelism across multi-node GPU clusters.

Description

The DP+EP deployment pattern uses the LLMInferenceService parallelism spec and Leader Worker Set (LWS) to coordinate multi-node inference. Key environment variables configure NCCL for RDMA inter-node communication and NVSHMEM for GPU-direct access.

Usage

Use for models like DeepSeek-R1 that require multiple GPU nodes. Requires RDMA networking and SR-IOV configured.

Code Reference

Source Location

Repository: kserve
File: docs/samples/llmisvc/dp-ep/deepseek-r1-gpu-rdma-roce/llm-inference-service-dp-ep-deepseek-r1-gpu-deepep-ht.yaml, Lines 1-182

Signature

apiVersion: serving.kserve.io/v1alpha1
kind: LLMInferenceService
metadata:
  name: deepseek-r1-dp-ep
spec:
  model:
    uri: "pvc://llm-test-pvc-deepseek"
    name: "deepseek-ai/DeepSeek-R1-0528"
  parallelism:
    data: 32
    dataLocal: 8
    expert: true
    tensor: 1
  template:
    spec:
      containers:
        - name: main
          env:
            - name: VLLM_ALL2ALL_BACKEND
              value: "deepep_high_throughput"
            - name: NCCL_IB_GID_INDEX
              value: "3"
            - name: NCCL_SOCKET_IFNAME
              value: "net1"
            - name: NVSHMEM_REMOTE_TRANSPORT
              value: "ibgda"
            - name: NVIDIA_GDRCOPY
              value: "enabled"
          resources:
            limits:
              nvidia.com/gpu: "8"
              rdma/roce_gdr: 1
              memory: 512Gi

Import

kubectl apply -f llm-inference-service-dp-ep-deepseek-r1-gpu-deepep-ht.yaml

I/O Contract

Inputs

Name	Type	Required	Description
parallelism.data	int	Yes	Total data parallel ranks
parallelism.dataLocal	int	Yes	Data parallel ranks per node
parallelism.expert	bool	No	Enable expert parallelism for MoE
parallelism.tensor	int	No	Tensor parallel degree
RDMA network	network	Yes	SR-IOV RDMA for inter-node NCCL

Outputs

Name	Type	Description
Worker pods	LeaderWorkerSet	Multi-node vLLM worker pods coordinated by LWS
InferencePool	CRD	Endpoint pool for scheduler
Distributed model	vLLM	Model sharded across all GPUs with DP+EP

Usage Examples

Deploy DeepSeek-R1

# 1. Ensure RDMA and PVC are ready
kubectl get sriovnetworknodepolicy
kubectl get pvc llm-test-pvc-deepseek

# 2. Deploy
kubectl apply -f llm-inference-service-dp-ep-deepseek-r1-gpu-deepep-ht.yaml

# 3. Monitor (model loading takes ~80 min for 600B model)
kubectl get llminferenceservice -owide
kubectl get pods -l app.kubernetes.io/component=llminferenceservice-workload

# 4. Watch logs for NCCL initialization
kubectl logs <leader-pod> -c main | grep "NCCL"

Related Pages

Implements Principle

Principle:Kserve_Kserve_Disaggregated_Deployment

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment