Principle:Kserve Kserve PD Scheduler Routing
| Knowledge Sources | |
|---|---|
| Domains | Scheduling, LLM_Serving, Traffic_Management |
| Last Updated | 2026-02-13 00:00 GMT |
Overview
An intelligent request scheduling pattern that routes inference requests to optimal GPU endpoints based on KV cache utilization, prefix cache hits, and queue depth.
Description
The PD Scheduler is an endpoint picker that sits between the Envoy Gateway and the model serving pods. It uses a plugin-based scoring system to select the best endpoint for each request:
- queue-scorer: Penalizes endpoints with long request queues.
- kv-cache-utilization-scorer: Penalizes endpoints with high KV cache memory usage.
- prefix-cache-scorer: Rewards endpoints that already have the request's prefix in cache.
- max-score-picker: Selects the endpoint with the highest total score.
For prefill-decode serving, a PD profile handler routes new requests to prefill pods and continuations to decode pods.
Usage
The scheduler is automatically deployed by the LLMIsvc controller when spec.router.scheduler is configured. Tune scorer weights to optimize for latency (favor prefix cache) or throughput (favor queue balance).
Theoretical Basis
# Scheduler scoring model (NOT implementation code)
For each request:
For each healthy endpoint:
score = 0
score += queue_scorer.Score(endpoint) * weight_queue (weight: 2)
score += kv_cache_scorer.Score(endpoint) * weight_kv_cache (weight: 2)
score += prefix_scorer.Score(endpoint, request) * weight_prefix (weight: 3)
selected = max_score_picker.Pick(scores)
route request → selected endpoint
PD profile handler:
New request (no KV state) → route to prefill pool
Continuation (has KV) → route to decode pool