Overview
This file defines an LLMInferenceService for DeepSeek-R1-0528 with prefill-decode separation using the DeepEP high-throughput backend for prefill and the Perplexity (pplx) backend for decode, with RDMA networking.
Description
This sample YAML deploys a v1alpha1 LLMInferenceService with a hybrid prefill-decode architecture that uses different all-to-all backends for each phase. The main template (decode) uses VLLM_ALL2ALL_BACKEND=pplx (Perplexity backend), while the prefill section uses VLLM_ALL2ALL_BACKEND=deepep_high_throughput (DeepEP high-throughput backend). This enables independent optimization of each phase with the most suitable expert parallelism implementation. Both pools share the same parallelism configuration (16-way data parallel, 8-way local, expert enabled) and RDMA/RoCE networking.
Usage
Use this sample as a reference for deploying a hybrid PD architecture that optimizes prefill and decode phases independently with different expert parallelism backends. This pattern is useful when the Perplexity backend offers better decode throughput characteristics while DeepEP is preferred for prefill workloads. Requires GPU nodes with RoCE networking and pre-populated model weights.
Code Reference
Source Location
Signature
apiVersion: serving.kserve.io/v1alpha1
kind: LLMInferenceService
metadata:
name: deepseek-r1-0528-pd
annotations:
k8s.v1.cni.cncf.io/networks: roce-p2
spec:
model:
uri: pvc://llm-test-pvc-deepseek
name: deepseek-ai/DeepSeek-R1-0528
replicas: 1
parallelism:
data: 16
dataLocal: 8
expert: true
tensor: 1
router:
scheduler: {}
route: {}
gateway: {}
template:
serviceAccountName: hfsa
containers:
- name: main
env:
- name: VLLM_ALL2ALL_BACKEND
value: pplx # Perplexity backend for decode
- name: VLLM_ADDITIONAL_ARGS
value: "--gpu-memory-utilization 0.99 --max-model-len 4096 ..."
worker:
# ... worker template (pplx backend)
prefill:
replicas: 1
parallelism:
data: 16
dataLocal: 8
expert: true
tensor: 1
template:
containers:
- name: main
env:
- name: VLLM_ALL2ALL_BACKEND
value: deepep_high_throughput # DeepEP backend for prefill
Import
kubectl apply -f docs/samples/llmisvc/dp-ep/deepseek-r1-gpu-rdma-roce/llm-inference-service-dp-ep-deepseek-r1-pd-gpu-p-deepep-ht-d-pplx.yaml
I/O Contract
Model Configuration
| Field |
Value |
Description
|
spec.model.uri |
pvc://llm-test-pvc-deepseek |
PVC containing model weights
|
spec.model.name |
deepseek-ai/DeepSeek-R1-0528 |
HuggingFace model identifier
|
Hybrid Backend Configuration
| Pool |
All-to-All Backend |
Rationale
|
| Decode (main + worker) |
pplx (Perplexity) |
Optimized for token generation throughput
|
| Prefill |
deepep_high_throughput (DeepEP) |
Optimized for batch prompt processing
|
Parallelism Settings
| Pool |
Data |
DataLocal |
Expert |
Tensor
|
| Decode (main) |
16 |
8 |
true |
1
|
| Prefill |
16 |
8 |
true |
1
|
GPU Resource Requirements (per container)
| Resource |
Requests |
Limits
|
| CPU |
64 |
128
|
| Memory |
256Gi |
512Gi
|
| Ephemeral Storage |
800Gi |
800Gi
|
| NVIDIA GPUs |
8 |
8
|
| RDMA/RoCE GDR |
1 |
1
|
Key Differences from DeepEP-HT Only Sample
| Aspect |
This Sample (Hybrid) |
DeepEP-HT Only Sample
|
| Decode Backend |
pplx |
deepep_high_throughput
|
| Prefill Backend |
deepep_high_throughput |
deepep_high_throughput
|
| Use Case |
Workloads where decode benefits from Perplexity optimizations |
Uniform backend for both phases
|
Usage Examples
# Prerequisites:
# 1. PVC with DeepSeek-R1-0528 model weights
# 2. RoCE networking configured (Multus + SR-IOV)
# 3. KServe LLMInferenceService CRDs installed
# Deploy the hybrid PD service
kubectl apply -f docs/samples/llmisvc/dp-ep/deepseek-r1-gpu-rdma-roce/llm-inference-service-dp-ep-deepseek-r1-pd-gpu-p-deepep-ht-d-pplx.yaml
# Check status
kubectl get llmisvc deepseek-r1-0528-pd
Related Pages
Page Connections
Double-click a node to navigate. Hold to expand connections.