Implementation:Kserve Kserve PD Pool Templates

Knowledge Sources	KServe vLLM Performance
Domains	Performance_Tuning, LLM_Serving, GPU_Computing
Last Updated	2026-02-13 00:00 GMT

Overview

Concrete ConfigMap-based pod templates for configuring prefill and decode pool containers, probes, volumes, and sidecars.

Description

The LLMInferenceServiceConfig templates define the pod specs for prefill and decode pools:

Prefill template (config-llm-prefill-template.yaml): Single vLLM container on port 8000, liveness probe with 120s initial delay, /dev/shm emptyDir for CUDA IPC.
Decode template (config-llm-decode-template.yaml): vLLM on port 8001 (internal) plus routing sidecar on port 8000. The sidecar (llm-d-routing-sidecar:v0.4.0) uses native sidecar pattern (restartPolicy: Always) and NixlConnector v2.

Usage

Modify these templates in the LLMInferenceServiceConfig ConfigMaps to customize pod specifications. Changes apply to all new pods in the respective pools.

Code Reference

Source Location

Repository: kserve
File: config/llmisvcconfig/config-llm-prefill-template.yaml, Lines 1-89
File: config/llmisvcconfig/config-llm-decode-template.yaml, Lines 1-146

Signature

Prefill Template

# config-llm-prefill-template.yaml (key fields)
containers:
  - name: main
    command: [vllm, serve, /mnt/models]
    ports:
      - containerPort: 8000
    livenessProbe:
      httpGet: {path: /health, port: 8000}
      initialDelaySeconds: 120
    readinessProbe:
      httpGet: {path: /health, port: 8000}
      initialDelaySeconds: 10
      failureThreshold: 60
volumes:
  - name: dshm
    emptyDir:
      medium: Memory
      sizeLimit: 1Gi

Decode Template

# config-llm-decode-template.yaml (key fields)
containers:
  - name: main
    command: [vllm, serve, /mnt/models]
    ports:
      - containerPort: 8001   # Internal vLLM port
initContainers:
  - name: routing-sidecar
    image: ghcr.io/llm-d/llm-d-routing-sidecar:v0.4.0
    restartPolicy: Always     # Native sidecar pattern
    args: ["--port=8000", "--vllm-port=8001", "--connector=nixlv2"]
    ports:
      - containerPort: 8000   # Public-facing port

Import

# Templates are deployed via kustomize
kubectl apply -k config/llmisvcconfig/

I/O Contract

Inputs

Name	Type	Required	Description
Prefill template	ConfigMap	Yes	Pod spec for prefill pool containers
Decode template	ConfigMap	Yes	Pod spec for decode pool with routing sidecar
spec.replicas	int	Yes	Decode pool replica count
spec.prefill.replicas	int	Yes	Prefill pool replica count

Outputs

Name	Type	Description
Prefill pods	Pods	vLLM direct serving on port 8000
Decode pods	Pods	vLLM on 8001 + routing sidecar on 8000
/dev/shm	tmpfs	Shared memory for CUDA IPC (1Gi)

Usage Examples

Scale Pools Independently

# Scale decode pool to 4 replicas
kubectl patch llminferenceservice qwen2-7b-pd \
  --type merge -p '{"spec":{"replicas":4}}'

# Scale prefill pool to 2 replicas
kubectl patch llminferenceservice qwen2-7b-pd \
  --type merge -p '{"spec":{"prefill":{"replicas":2}}}'

# Verify pool sizes
kubectl get pods -l app.kubernetes.io/component=llminferenceservice-workload
kubectl get pods -l app.kubernetes.io/component=llminferenceservice-prefill

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment