Implementation:Kserve Kserve PD LLMInferenceService Spec

Knowledge Sources	KServe vLLM Disaggregated Prefill
Domains	LLM_Serving, Distributed_Systems, GPU_Computing
Last Updated	2026-02-13 00:00 GMT

Overview

Concrete YAML specification for deploying LLMInferenceService with separate prefill and decode pools using NixlConnector for KV cache transfer.

Description

The PD specification extends the standard LLMInferenceService with a spec.prefill section that defines the prefill pool. Both pools run vLLM with KV transfer configuration via VLLM_ADDITIONAL_ARGS environment variable specifying the NixlConnector. RDMA networking is enabled through Multus annotations and rdma/roce_gdr resource requests.

Usage

Use this when deploying disaggregated prefill-decode serving. Requires RDMA networking (see RDMA_Network_Configuration) and model weights accessible to both pools.

Code Reference

Source Location

Repository: kserve
File: docs/samples/llmisvc/single-node-gpu/llm-inference-service-pd-qwen2-7b-gpu.yaml, Lines 1-125
File: pkg/apis/serving/v1alpha1/llm_inference_service_types.go, Lines 77-103 (Prefill field)

Signature

apiVersion: serving.kserve.io/v1alpha1
kind: LLMInferenceService
metadata:
  name: qwen2-7b-pd
spec:
  model:
    uri: "hf://Qwen/Qwen2.5-7B-Instruct"
  replicas: 1          # Decode pool
  prefill:
    replicas: 2        # Prefill pool
  template:
    spec:
      containers:
        - name: main
          env:
            - name: VLLM_ADDITIONAL_ARGS
              value: '--kv_transfer_config ''{"kv_connector":"NixlConnector","kv_role":"kv_both"}'''
            - name: UCX_TLS
              value: "rc,sm,self,cuda_copy,cuda_ipc"
            - name: KSERVE_INFER_ROCE
              value: "true"
          resources:
            limits:
              nvidia.com/gpu: "1"
              rdma/roce_gdr: 1
              memory: 32Gi

Import

kubectl apply -f llm-inference-service-pd-qwen2-7b-gpu.yaml

I/O Contract

Inputs

Name	Type	Required	Description
spec.prefill.replicas	*int32	Yes	Number of prefill pool replicas
spec.replicas	*int32	Yes	Number of decode pool replicas
VLLM_ADDITIONAL_ARGS	env	Yes	NixlConnector KV transfer config
UCX_TLS	env	Yes	UCX transport layer selection
rdma/roce_gdr	resource	Yes (RDMA)	RDMA network interface

Outputs

Name	Type	Description
Prefill pods	Pods	vLLM instances for prompt processing
Decode pods	Pods	vLLM instances for token generation
KV transfer	NixlConnector	KV cache transferred via RDMA between pools

Usage Examples

Deploy PD Service

# 1. Ensure RDMA networking is configured
kubectl get sriovnetworknodepolicy

# 2. Deploy PD LLMInferenceService
kubectl apply -f llm-inference-service-pd-qwen2-7b-gpu.yaml

# 3. Monitor both pools
kubectl get pods -l app.kubernetes.io/component=llminferenceservice-workload
kubectl get pods -l app.kubernetes.io/component=llminferenceservice-prefill

# 4. Check KV transfer in logs
kubectl logs <prefill-pod> -c main | grep "NixlConnector"
kubectl logs <decode-pod> -c main | grep "NixlConnector"

Related Pages

Implements Principle

Principle:Kserve_Kserve_Prefill_Decode_Specification

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment