Overview
This file defines an LLMInferenceService for DeepSeek-R1-0528 with prefill-decode (PD) separation where both prefill and decode pools use the DeepEP high-throughput all-to-all backend with RDMA networking.
Description
This sample YAML deploys a v1alpha1 LLMInferenceService with separate prefill and decode template sections, each configured with their own container specs, GPU resources, RDMA configuration, and KV cache transfer settings via the NixlConnector. The deployment uses 16-way data parallelism with 8-way local data parallelism, expert parallelism enabled, and tensor parallelism of 1. Both the main template (decode) and the prefill section use deepep_high_throughput as the all-to-all backend, with extensive NCCL, NVSHMEM, and UCX configuration for RDMA over RoCE networking.
Usage
Use this sample as a reference for deploying the most advanced LLM deployment pattern combining expert parallelism, prefill-decode separation, and RDMA-accelerated KV cache transfer on GPU clusters with RoCE networking. Requires nodes with 8 NVIDIA GPUs, RDMA/RoCE network interfaces, and a pre-populated PVC with the DeepSeek-R1-0528 model weights.
Code Reference
Source Location
Signature
apiVersion: serving.kserve.io/v1alpha1
kind: LLMInferenceService
metadata:
name: deepseek-r1-0528-pd
annotations:
k8s.v1.cni.cncf.io/networks: roce-p2
spec:
model:
uri: pvc://llm-test-pvc-deepseek
name: deepseek-ai/DeepSeek-R1-0528
replicas: 1
parallelism:
data: 16
dataLocal: 8
expert: true
tensor: 1
router:
scheduler: {}
route: {}
gateway: {}
template:
serviceAccountName: hfsa
containers:
- name: main
env:
- name: VLLM_ALL2ALL_BACKEND
value: deepep_high_throughput
- name: VLLM_ADDITIONAL_ARGS
value: "--gpu-memory-utilization 0.99 --max-model-len 4096 ..."
worker:
# ... worker template (same backend)
prefill:
replicas: 1
parallelism:
data: 16
dataLocal: 8
expert: true
tensor: 1
template:
# ... prefill template (deepep_high_throughput)
Import
kubectl apply -f docs/samples/llmisvc/dp-ep/deepseek-r1-gpu-rdma-roce/llm-inference-service-dp-ep-deepseek-r1-pd-gpu-p-deepep-ht-d-deepep-ht.yaml
I/O Contract
Model Configuration
| Field |
Value |
Description
|
spec.model.uri |
pvc://llm-test-pvc-deepseek |
PVC containing model weights
|
spec.model.name |
deepseek-ai/DeepSeek-R1-0528 |
HuggingFace model identifier
|
Parallelism Settings
| Pool |
Data |
DataLocal |
Expert |
Tensor
|
| Decode (main) |
16 |
8 |
true |
1
|
| Prefill |
16 |
8 |
true |
1
|
GPU Resource Requirements (per container)
| Resource |
Requests |
Limits
|
| CPU |
64 |
128
|
| Memory |
256Gi |
512Gi
|
| Ephemeral Storage |
800Gi |
800Gi
|
| NVIDIA GPUs |
8 |
8
|
| RDMA/RoCE GDR |
1 |
1
|
Key Environment Variables
| Variable |
Value |
Description
|
VLLM_ALL2ALL_BACKEND |
deepep_high_throughput |
DeepEP high-throughput all-to-all backend for MoE dispatch
|
VLLM_ADDITIONAL_ARGS |
--gpu-memory-utilization 0.99 ... |
vLLM arguments including NixlConnector KV cache transfer
|
NCCL_IB_GID_INDEX |
3 |
InfiniBand GID index for RoCE v2
|
NVSHMEM_REMOTE_TRANSPORT |
ibgda |
GPU-direct async RDMA transport
|
UCX_TLS |
rc,sm,self,cuda_copy,cuda_ipc |
UCX transport layers
|
Network Configuration
| Component |
Configuration
|
| CNI Network |
roce-p2 (via Multus annotation)
|
| NCCL Socket Interface |
net1
|
| NVSHMEM Bootstrap |
Two-stage with 300s timeout on net1
|
| GPU Direct |
NVIDIA GDRCOPY enabled
|
Usage Examples
# Prerequisites:
# 1. PVC with DeepSeek-R1-0528 model weights
# 2. RoCE networking configured (Multus + SR-IOV)
# 3. KServe LLMInferenceService CRDs installed
# Deploy the service
kubectl apply -f docs/samples/llmisvc/dp-ep/deepseek-r1-gpu-rdma-roce/llm-inference-service-dp-ep-deepseek-r1-pd-gpu-p-deepep-ht-d-deepep-ht.yaml
# Check status
kubectl get llmisvc deepseek-r1-0528-pd
Related Pages
Page Connections
Double-click a node to navigate. Hold to expand connections.