Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Kserve Kserve PD Pool Templates

From Leeroopedia
Knowledge Sources
Domains Performance_Tuning, LLM_Serving, GPU_Computing
Last Updated 2026-02-13 00:00 GMT

Overview

Concrete ConfigMap-based pod templates for configuring prefill and decode pool containers, probes, volumes, and sidecars.

Description

The LLMInferenceServiceConfig templates define the pod specs for prefill and decode pools:

  • Prefill template (config-llm-prefill-template.yaml): Single vLLM container on port 8000, liveness probe with 120s initial delay, /dev/shm emptyDir for CUDA IPC.
  • Decode template (config-llm-decode-template.yaml): vLLM on port 8001 (internal) plus routing sidecar on port 8000. The sidecar (llm-d-routing-sidecar:v0.4.0) uses native sidecar pattern (restartPolicy: Always) and NixlConnector v2.

Usage

Modify these templates in the LLMInferenceServiceConfig ConfigMaps to customize pod specifications. Changes apply to all new pods in the respective pools.

Code Reference

Source Location

  • Repository: kserve
  • File: config/llmisvcconfig/config-llm-prefill-template.yaml, Lines 1-89
  • File: config/llmisvcconfig/config-llm-decode-template.yaml, Lines 1-146

Signature

Prefill Template

# config-llm-prefill-template.yaml (key fields)
containers:
  - name: main
    command: [vllm, serve, /mnt/models]
    ports:
      - containerPort: 8000
    livenessProbe:
      httpGet: {path: /health, port: 8000}
      initialDelaySeconds: 120
    readinessProbe:
      httpGet: {path: /health, port: 8000}
      initialDelaySeconds: 10
      failureThreshold: 60
volumes:
  - name: dshm
    emptyDir:
      medium: Memory
      sizeLimit: 1Gi

Decode Template

# config-llm-decode-template.yaml (key fields)
containers:
  - name: main
    command: [vllm, serve, /mnt/models]
    ports:
      - containerPort: 8001   # Internal vLLM port
initContainers:
  - name: routing-sidecar
    image: ghcr.io/llm-d/llm-d-routing-sidecar:v0.4.0
    restartPolicy: Always     # Native sidecar pattern
    args: ["--port=8000", "--vllm-port=8001", "--connector=nixlv2"]
    ports:
      - containerPort: 8000   # Public-facing port

Import

# Templates are deployed via kustomize
kubectl apply -k config/llmisvcconfig/

I/O Contract

Inputs

Name Type Required Description
Prefill template ConfigMap Yes Pod spec for prefill pool containers
Decode template ConfigMap Yes Pod spec for decode pool with routing sidecar
spec.replicas int Yes Decode pool replica count
spec.prefill.replicas int Yes Prefill pool replica count

Outputs

Name Type Description
Prefill pods Pods vLLM direct serving on port 8000
Decode pods Pods vLLM on 8001 + routing sidecar on 8000
/dev/shm tmpfs Shared memory for CUDA IPC (1Gi)

Usage Examples

Scale Pools Independently

# Scale decode pool to 4 replicas
kubectl patch llminferenceservice qwen2-7b-pd \
  --type merge -p '{"spec":{"replicas":4}}'

# Scale prefill pool to 2 replicas
kubectl patch llminferenceservice qwen2-7b-pd \
  --type merge -p '{"spec":{"prefill":{"replicas":2}}}'

# Verify pool sizes
kubectl get pods -l app.kubernetes.io/component=llminferenceservice-workload
kubectl get pods -l app.kubernetes.io/component=llminferenceservice-prefill

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment