Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Kserve Kserve PD LLMInferenceService Spec

From Leeroopedia
Knowledge Sources
Domains LLM_Serving, Distributed_Systems, GPU_Computing
Last Updated 2026-02-13 00:00 GMT

Overview

Concrete YAML specification for deploying LLMInferenceService with separate prefill and decode pools using NixlConnector for KV cache transfer.

Description

The PD specification extends the standard LLMInferenceService with a spec.prefill section that defines the prefill pool. Both pools run vLLM with KV transfer configuration via VLLM_ADDITIONAL_ARGS environment variable specifying the NixlConnector. RDMA networking is enabled through Multus annotations and rdma/roce_gdr resource requests.

Usage

Use this when deploying disaggregated prefill-decode serving. Requires RDMA networking (see RDMA_Network_Configuration) and model weights accessible to both pools.

Code Reference

Source Location

  • Repository: kserve
  • File: docs/samples/llmisvc/single-node-gpu/llm-inference-service-pd-qwen2-7b-gpu.yaml, Lines 1-125
  • File: pkg/apis/serving/v1alpha1/llm_inference_service_types.go, Lines 77-103 (Prefill field)

Signature

apiVersion: serving.kserve.io/v1alpha1
kind: LLMInferenceService
metadata:
  name: qwen2-7b-pd
spec:
  model:
    uri: "hf://Qwen/Qwen2.5-7B-Instruct"
  replicas: 1          # Decode pool
  prefill:
    replicas: 2        # Prefill pool
  template:
    spec:
      containers:
        - name: main
          env:
            - name: VLLM_ADDITIONAL_ARGS
              value: '--kv_transfer_config ''{"kv_connector":"NixlConnector","kv_role":"kv_both"}'''
            - name: UCX_TLS
              value: "rc,sm,self,cuda_copy,cuda_ipc"
            - name: KSERVE_INFER_ROCE
              value: "true"
          resources:
            limits:
              nvidia.com/gpu: "1"
              rdma/roce_gdr: 1
              memory: 32Gi

Import

kubectl apply -f llm-inference-service-pd-qwen2-7b-gpu.yaml

I/O Contract

Inputs

Name Type Required Description
spec.prefill.replicas *int32 Yes Number of prefill pool replicas
spec.replicas *int32 Yes Number of decode pool replicas
VLLM_ADDITIONAL_ARGS env Yes NixlConnector KV transfer config
UCX_TLS env Yes UCX transport layer selection
rdma/roce_gdr resource Yes (RDMA) RDMA network interface

Outputs

Name Type Description
Prefill pods Pods vLLM instances for prompt processing
Decode pods Pods vLLM instances for token generation
KV transfer NixlConnector KV cache transferred via RDMA between pools

Usage Examples

Deploy PD Service

# 1. Ensure RDMA networking is configured
kubectl get sriovnetworknodepolicy

# 2. Deploy PD LLMInferenceService
kubectl apply -f llm-inference-service-pd-qwen2-7b-gpu.yaml

# 3. Monitor both pools
kubectl get pods -l app.kubernetes.io/component=llminferenceservice-workload
kubectl get pods -l app.kubernetes.io/component=llminferenceservice-prefill

# 4. Check KV transfer in logs
kubectl logs <prefill-pod> -c main | grep "NixlConnector"
kubectl logs <decode-pod> -c main | grep "NixlConnector"

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment