Implementation:Kserve Kserve PD LLMInferenceService Spec
| Knowledge Sources | |
|---|---|
| Domains | LLM_Serving, Distributed_Systems, GPU_Computing |
| Last Updated | 2026-02-13 00:00 GMT |
Overview
Concrete YAML specification for deploying LLMInferenceService with separate prefill and decode pools using NixlConnector for KV cache transfer.
Description
The PD specification extends the standard LLMInferenceService with a spec.prefill section that defines the prefill pool. Both pools run vLLM with KV transfer configuration via VLLM_ADDITIONAL_ARGS environment variable specifying the NixlConnector. RDMA networking is enabled through Multus annotations and rdma/roce_gdr resource requests.
Usage
Use this when deploying disaggregated prefill-decode serving. Requires RDMA networking (see RDMA_Network_Configuration) and model weights accessible to both pools.
Code Reference
Source Location
- Repository: kserve
- File: docs/samples/llmisvc/single-node-gpu/llm-inference-service-pd-qwen2-7b-gpu.yaml, Lines 1-125
- File: pkg/apis/serving/v1alpha1/llm_inference_service_types.go, Lines 77-103 (Prefill field)
Signature
apiVersion: serving.kserve.io/v1alpha1
kind: LLMInferenceService
metadata:
name: qwen2-7b-pd
spec:
model:
uri: "hf://Qwen/Qwen2.5-7B-Instruct"
replicas: 1 # Decode pool
prefill:
replicas: 2 # Prefill pool
template:
spec:
containers:
- name: main
env:
- name: VLLM_ADDITIONAL_ARGS
value: '--kv_transfer_config ''{"kv_connector":"NixlConnector","kv_role":"kv_both"}'''
- name: UCX_TLS
value: "rc,sm,self,cuda_copy,cuda_ipc"
- name: KSERVE_INFER_ROCE
value: "true"
resources:
limits:
nvidia.com/gpu: "1"
rdma/roce_gdr: 1
memory: 32Gi
Import
kubectl apply -f llm-inference-service-pd-qwen2-7b-gpu.yaml
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| spec.prefill.replicas | *int32 | Yes | Number of prefill pool replicas |
| spec.replicas | *int32 | Yes | Number of decode pool replicas |
| VLLM_ADDITIONAL_ARGS | env | Yes | NixlConnector KV transfer config |
| UCX_TLS | env | Yes | UCX transport layer selection |
| rdma/roce_gdr | resource | Yes (RDMA) | RDMA network interface |
Outputs
| Name | Type | Description |
|---|---|---|
| Prefill pods | Pods | vLLM instances for prompt processing |
| Decode pods | Pods | vLLM instances for token generation |
| KV transfer | NixlConnector | KV cache transferred via RDMA between pools |
Usage Examples
Deploy PD Service
# 1. Ensure RDMA networking is configured
kubectl get sriovnetworknodepolicy
# 2. Deploy PD LLMInferenceService
kubectl apply -f llm-inference-service-pd-qwen2-7b-gpu.yaml
# 3. Monitor both pools
kubectl get pods -l app.kubernetes.io/component=llminferenceservice-workload
kubectl get pods -l app.kubernetes.io/component=llminferenceservice-prefill
# 4. Check KV transfer in logs
kubectl logs <prefill-pod> -c main | grep "NixlConnector"
kubectl logs <decode-pod> -c main | grep "NixlConnector"