Implementation:Kserve Kserve DeepSeek R1 PD DeepEP HT Sample

Knowledge Sources	Kserve_Kserve
Domains	Kubernetes, LLM Inference, Expert Parallelism, RDMA
Last Updated	2026-02-13 00:00 GMT

Overview

This file defines an LLMInferenceService for DeepSeek-R1-0528 with prefill-decode (PD) separation where both prefill and decode pools use the DeepEP high-throughput all-to-all backend with RDMA networking.

Description

This sample YAML deploys a v1alpha1 LLMInferenceService with separate prefill and decode template sections, each configured with their own container specs, GPU resources, RDMA configuration, and KV cache transfer settings via the NixlConnector. The deployment uses 16-way data parallelism with 8-way local data parallelism, expert parallelism enabled, and tensor parallelism of 1. Both the main template (decode) and the prefill section use deepep_high_throughput as the all-to-all backend, with extensive NCCL, NVSHMEM, and UCX configuration for RDMA over RoCE networking.

Usage

Use this sample as a reference for deploying the most advanced LLM deployment pattern combining expert parallelism, prefill-decode separation, and RDMA-accelerated KV cache transfer on GPU clusters with RoCE networking. Requires nodes with 8 NVIDIA GPUs, RDMA/RoCE network interfaces, and a pre-populated PVC with the DeepSeek-R1-0528 model weights.

Code Reference

Source Location

Repository: Kserve_Kserve
File: docs/samples/llmisvc/dp-ep/deepseek-r1-gpu-rdma-roce/llm-inference-service-dp-ep-deepseek-r1-pd-gpu-p-deepep-ht-d-deepep-ht.yaml

Signature

apiVersion: serving.kserve.io/v1alpha1
kind: LLMInferenceService
metadata:
  name: deepseek-r1-0528-pd
  annotations:
    k8s.v1.cni.cncf.io/networks: roce-p2
spec:
  model:
    uri: pvc://llm-test-pvc-deepseek
    name: deepseek-ai/DeepSeek-R1-0528
  replicas: 1
  parallelism:
    data: 16
    dataLocal: 8
    expert: true
    tensor: 1
  router:
    scheduler: {}
    route: {}
    gateway: {}
  template:
    serviceAccountName: hfsa
    containers:
      - name: main
        env:
          - name: VLLM_ALL2ALL_BACKEND
            value: deepep_high_throughput
          - name: VLLM_ADDITIONAL_ARGS
            value: "--gpu-memory-utilization 0.99 --max-model-len 4096 ..."
  worker:
    # ... worker template (same backend)
  prefill:
    replicas: 1
    parallelism:
      data: 16
      dataLocal: 8
      expert: true
      tensor: 1
    template:
      # ... prefill template (deepep_high_throughput)

Import

kubectl apply -f docs/samples/llmisvc/dp-ep/deepseek-r1-gpu-rdma-roce/llm-inference-service-dp-ep-deepseek-r1-pd-gpu-p-deepep-ht-d-deepep-ht.yaml

I/O Contract

Model Configuration

Field	Value	Description
`spec.model.uri`	`pvc://llm-test-pvc-deepseek`	PVC containing model weights
`spec.model.name`	`deepseek-ai/DeepSeek-R1-0528`	HuggingFace model identifier

Parallelism Settings

Pool	Data	DataLocal	Expert	Tensor
Decode (main)	16	8	true	1
Prefill	16	8	true	1

GPU Resource Requirements (per container)

Resource	Requests	Limits
CPU	64	128
Memory	256Gi	512Gi
Ephemeral Storage	800Gi	800Gi
NVIDIA GPUs	8	8
RDMA/RoCE GDR	1	1

Key Environment Variables

Variable	Value	Description
`VLLM_ALL2ALL_BACKEND`	`deepep_high_throughput`	DeepEP high-throughput all-to-all backend for MoE dispatch
`VLLM_ADDITIONAL_ARGS`	`--gpu-memory-utilization 0.99 ...`	vLLM arguments including NixlConnector KV cache transfer
`NCCL_IB_GID_INDEX`	`3`	InfiniBand GID index for RoCE v2
`NVSHMEM_REMOTE_TRANSPORT`	`ibgda`	GPU-direct async RDMA transport
`UCX_TLS`	`rc,sm,self,cuda_copy,cuda_ipc`	UCX transport layers

Network Configuration

Component	Configuration
CNI Network	`roce-p2` (via Multus annotation)
NCCL Socket Interface	`net1`
NVSHMEM Bootstrap	Two-stage with 300s timeout on `net1`
GPU Direct	NVIDIA GDRCOPY enabled

Usage Examples

# Prerequisites:
# 1. PVC with DeepSeek-R1-0528 model weights
# 2. RoCE networking configured (Multus + SR-IOV)
# 3. KServe LLMInferenceService CRDs installed

# Deploy the service
kubectl apply -f docs/samples/llmisvc/dp-ep/deepseek-r1-gpu-rdma-roce/llm-inference-service-dp-ep-deepseek-r1-pd-gpu-p-deepep-ht-d-deepep-ht.yaml

# Check status
kubectl get llmisvc deepseek-r1-0528-pd

Related Pages

Kserve_Kserve_DeepSeek_R1_PD_DeepEP_Pplx_Sample - Hybrid variant using Perplexity backend for decode
Kserve_Kserve_LLMInferenceService_Minimal_CRD - CRD definition for the LLMInferenceService resource
Kserve_Kserve_LLM_Decode_Worker_DP_Config - Base decode worker configuration template
Kserve_Kserve_LLM_Prefill_Worker_DP_Config - Base prefill worker configuration template
Kserve_Kserve_Gateway_Inference_Extension_CRDs - Gateway routing CRDs used by the router section

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment