Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Kserve Kserve DP EP Deployment Pattern

From Leeroopedia
Knowledge Sources
Domains Distributed_Systems, LLM_Serving, GPU_Computing
Last Updated 2026-02-13 00:00 GMT

Overview

Concrete YAML pattern for deploying large MoE LLMs with data parallelism and expert parallelism across multi-node GPU clusters.

Description

The DP+EP deployment pattern uses the LLMInferenceService parallelism spec and Leader Worker Set (LWS) to coordinate multi-node inference. Key environment variables configure NCCL for RDMA inter-node communication and NVSHMEM for GPU-direct access.

Usage

Use for models like DeepSeek-R1 that require multiple GPU nodes. Requires RDMA networking and SR-IOV configured.

Code Reference

Source Location

  • Repository: kserve
  • File: docs/samples/llmisvc/dp-ep/deepseek-r1-gpu-rdma-roce/llm-inference-service-dp-ep-deepseek-r1-gpu-deepep-ht.yaml, Lines 1-182

Signature

apiVersion: serving.kserve.io/v1alpha1
kind: LLMInferenceService
metadata:
  name: deepseek-r1-dp-ep
spec:
  model:
    uri: "pvc://llm-test-pvc-deepseek"
    name: "deepseek-ai/DeepSeek-R1-0528"
  parallelism:
    data: 32
    dataLocal: 8
    expert: true
    tensor: 1
  template:
    spec:
      containers:
        - name: main
          env:
            - name: VLLM_ALL2ALL_BACKEND
              value: "deepep_high_throughput"
            - name: NCCL_IB_GID_INDEX
              value: "3"
            - name: NCCL_SOCKET_IFNAME
              value: "net1"
            - name: NVSHMEM_REMOTE_TRANSPORT
              value: "ibgda"
            - name: NVIDIA_GDRCOPY
              value: "enabled"
          resources:
            limits:
              nvidia.com/gpu: "8"
              rdma/roce_gdr: 1
              memory: 512Gi

Import

kubectl apply -f llm-inference-service-dp-ep-deepseek-r1-gpu-deepep-ht.yaml

I/O Contract

Inputs

Name Type Required Description
parallelism.data int Yes Total data parallel ranks
parallelism.dataLocal int Yes Data parallel ranks per node
parallelism.expert bool No Enable expert parallelism for MoE
parallelism.tensor int No Tensor parallel degree
RDMA network network Yes SR-IOV RDMA for inter-node NCCL

Outputs

Name Type Description
Worker pods LeaderWorkerSet Multi-node vLLM worker pods coordinated by LWS
InferencePool CRD Endpoint pool for scheduler
Distributed model vLLM Model sharded across all GPUs with DP+EP

Usage Examples

Deploy DeepSeek-R1

# 1. Ensure RDMA and PVC are ready
kubectl get sriovnetworknodepolicy
kubectl get pvc llm-test-pvc-deepseek

# 2. Deploy
kubectl apply -f llm-inference-service-dp-ep-deepseek-r1-gpu-deepep-ht.yaml

# 3. Monitor (model loading takes ~80 min for 600B model)
kubectl get llminferenceservice -owide
kubectl get pods -l app.kubernetes.io/component=llminferenceservice-workload

# 4. Watch logs for NCCL initialization
kubectl logs <leader-pod> -c main | grep "NCCL"

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment