Implementation:Kserve Kserve PD Pool Templates
Appearance
| Knowledge Sources | |
|---|---|
| Domains | Performance_Tuning, LLM_Serving, GPU_Computing |
| Last Updated | 2026-02-13 00:00 GMT |
Overview
Concrete ConfigMap-based pod templates for configuring prefill and decode pool containers, probes, volumes, and sidecars.
Description
The LLMInferenceServiceConfig templates define the pod specs for prefill and decode pools:
- Prefill template (
config-llm-prefill-template.yaml): Single vLLM container on port 8000, liveness probe with 120s initial delay,/dev/shmemptyDir for CUDA IPC. - Decode template (
config-llm-decode-template.yaml): vLLM on port 8001 (internal) plus routing sidecar on port 8000. The sidecar (llm-d-routing-sidecar:v0.4.0) uses native sidecar pattern (restartPolicy: Always) and NixlConnector v2.
Usage
Modify these templates in the LLMInferenceServiceConfig ConfigMaps to customize pod specifications. Changes apply to all new pods in the respective pools.
Code Reference
Source Location
- Repository: kserve
- File: config/llmisvcconfig/config-llm-prefill-template.yaml, Lines 1-89
- File: config/llmisvcconfig/config-llm-decode-template.yaml, Lines 1-146
Signature
Prefill Template
# config-llm-prefill-template.yaml (key fields)
containers:
- name: main
command: [vllm, serve, /mnt/models]
ports:
- containerPort: 8000
livenessProbe:
httpGet: {path: /health, port: 8000}
initialDelaySeconds: 120
readinessProbe:
httpGet: {path: /health, port: 8000}
initialDelaySeconds: 10
failureThreshold: 60
volumes:
- name: dshm
emptyDir:
medium: Memory
sizeLimit: 1Gi
Decode Template
# config-llm-decode-template.yaml (key fields)
containers:
- name: main
command: [vllm, serve, /mnt/models]
ports:
- containerPort: 8001 # Internal vLLM port
initContainers:
- name: routing-sidecar
image: ghcr.io/llm-d/llm-d-routing-sidecar:v0.4.0
restartPolicy: Always # Native sidecar pattern
args: ["--port=8000", "--vllm-port=8001", "--connector=nixlv2"]
ports:
- containerPort: 8000 # Public-facing port
Import
# Templates are deployed via kustomize
kubectl apply -k config/llmisvcconfig/
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| Prefill template | ConfigMap | Yes | Pod spec for prefill pool containers |
| Decode template | ConfigMap | Yes | Pod spec for decode pool with routing sidecar |
| spec.replicas | int | Yes | Decode pool replica count |
| spec.prefill.replicas | int | Yes | Prefill pool replica count |
Outputs
| Name | Type | Description |
|---|---|---|
| Prefill pods | Pods | vLLM direct serving on port 8000 |
| Decode pods | Pods | vLLM on 8001 + routing sidecar on 8000 |
| /dev/shm | tmpfs | Shared memory for CUDA IPC (1Gi) |
Usage Examples
Scale Pools Independently
# Scale decode pool to 4 replicas
kubectl patch llminferenceservice qwen2-7b-pd \
--type merge -p '{"spec":{"replicas":4}}'
# Scale prefill pool to 2 replicas
kubectl patch llminferenceservice qwen2-7b-pd \
--type merge -p '{"spec":{"prefill":{"replicas":2}}}'
# Verify pool sizes
kubectl get pods -l app.kubernetes.io/component=llminferenceservice-workload
kubectl get pods -l app.kubernetes.io/component=llminferenceservice-prefill
Related Pages
Implements Principle
Requires Environment
Uses Heuristic
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment