Implementation:Kserve Kserve DeepSeek R1 PD DeepEP Pplx Sample

Knowledge Sources	Kserve_Kserve
Domains	Kubernetes, LLM Inference, Expert Parallelism, RDMA
Last Updated	2026-02-13 00:00 GMT

Overview

This file defines an LLMInferenceService for DeepSeek-R1-0528 with prefill-decode separation using the DeepEP high-throughput backend for prefill and the Perplexity (pplx) backend for decode, with RDMA networking.

Description

This sample YAML deploys a v1alpha1 LLMInferenceService with a hybrid prefill-decode architecture that uses different all-to-all backends for each phase. The main template (decode) uses VLLM_ALL2ALL_BACKEND=pplx (Perplexity backend), while the prefill section uses VLLM_ALL2ALL_BACKEND=deepep_high_throughput (DeepEP high-throughput backend). This enables independent optimization of each phase with the most suitable expert parallelism implementation. Both pools share the same parallelism configuration (16-way data parallel, 8-way local, expert enabled) and RDMA/RoCE networking.

Usage

Use this sample as a reference for deploying a hybrid PD architecture that optimizes prefill and decode phases independently with different expert parallelism backends. This pattern is useful when the Perplexity backend offers better decode throughput characteristics while DeepEP is preferred for prefill workloads. Requires GPU nodes with RoCE networking and pre-populated model weights.

Code Reference

Source Location

Repository: Kserve_Kserve
File: docs/samples/llmisvc/dp-ep/deepseek-r1-gpu-rdma-roce/llm-inference-service-dp-ep-deepseek-r1-pd-gpu-p-deepep-ht-d-pplx.yaml

Signature

apiVersion: serving.kserve.io/v1alpha1
kind: LLMInferenceService
metadata:
  name: deepseek-r1-0528-pd
  annotations:
    k8s.v1.cni.cncf.io/networks: roce-p2
spec:
  model:
    uri: pvc://llm-test-pvc-deepseek
    name: deepseek-ai/DeepSeek-R1-0528
  replicas: 1
  parallelism:
    data: 16
    dataLocal: 8
    expert: true
    tensor: 1
  router:
    scheduler: {}
    route: {}
    gateway: {}
  template:
    serviceAccountName: hfsa
    containers:
      - name: main
        env:
          - name: VLLM_ALL2ALL_BACKEND
            value: pplx                     # Perplexity backend for decode
          - name: VLLM_ADDITIONAL_ARGS
            value: "--gpu-memory-utilization 0.99 --max-model-len 4096 ..."
  worker:
    # ... worker template (pplx backend)
  prefill:
    replicas: 1
    parallelism:
      data: 16
      dataLocal: 8
      expert: true
      tensor: 1
    template:
      containers:
        - name: main
          env:
            - name: VLLM_ALL2ALL_BACKEND
              value: deepep_high_throughput  # DeepEP backend for prefill

Import

kubectl apply -f docs/samples/llmisvc/dp-ep/deepseek-r1-gpu-rdma-roce/llm-inference-service-dp-ep-deepseek-r1-pd-gpu-p-deepep-ht-d-pplx.yaml

I/O Contract

Model Configuration

Field	Value	Description
`spec.model.uri`	`pvc://llm-test-pvc-deepseek`	PVC containing model weights
`spec.model.name`	`deepseek-ai/DeepSeek-R1-0528`	HuggingFace model identifier

Hybrid Backend Configuration

Pool	All-to-All Backend	Rationale
Decode (main + worker)	`pplx` (Perplexity)	Optimized for token generation throughput
Prefill	`deepep_high_throughput` (DeepEP)	Optimized for batch prompt processing

Parallelism Settings

Pool	Data	DataLocal	Expert	Tensor
Decode (main)	16	8	true	1
Prefill	16	8	true	1

GPU Resource Requirements (per container)

Resource	Requests	Limits
CPU	64	128
Memory	256Gi	512Gi
Ephemeral Storage	800Gi	800Gi
NVIDIA GPUs	8	8
RDMA/RoCE GDR	1	1

Key Differences from DeepEP-HT Only Sample

Aspect	This Sample (Hybrid)	DeepEP-HT Only Sample
Decode Backend	`pplx`	`deepep_high_throughput`
Prefill Backend	`deepep_high_throughput`	`deepep_high_throughput`
Use Case	Workloads where decode benefits from Perplexity optimizations	Uniform backend for both phases

Usage Examples

# Prerequisites:
# 1. PVC with DeepSeek-R1-0528 model weights
# 2. RoCE networking configured (Multus + SR-IOV)
# 3. KServe LLMInferenceService CRDs installed

# Deploy the hybrid PD service
kubectl apply -f docs/samples/llmisvc/dp-ep/deepseek-r1-gpu-rdma-roce/llm-inference-service-dp-ep-deepseek-r1-pd-gpu-p-deepep-ht-d-pplx.yaml

# Check status
kubectl get llmisvc deepseek-r1-0528-pd

Related Pages

Kserve_Kserve_DeepSeek_R1_PD_DeepEP_HT_Sample - Variant using DeepEP high-throughput for both prefill and decode
Kserve_Kserve_LLMInferenceService_Minimal_CRD - CRD definition for the LLMInferenceService resource
Kserve_Kserve_LLM_Decode_Worker_DP_Config - Base decode worker configuration template
Kserve_Kserve_LLM_Prefill_Worker_DP_Config - Base prefill worker configuration template
Kserve_Kserve_Gateway_Inference_Extension_CRDs - Gateway routing CRDs used by the router section

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment