Principle:Kserve Kserve PD Scheduler Routing

Knowledge Sources	Gateway Inference Extension LLM-d Scheduler
Domains	Scheduling, LLM_Serving, Traffic_Management
Last Updated	2026-02-13 00:00 GMT

Overview

An intelligent request scheduling pattern that routes inference requests to optimal GPU endpoints based on KV cache utilization, prefix cache hits, and queue depth.

Description

The PD Scheduler is an endpoint picker that sits between the Envoy Gateway and the model serving pods. It uses a plugin-based scoring system to select the best endpoint for each request:

queue-scorer: Penalizes endpoints with long request queues.
kv-cache-utilization-scorer: Penalizes endpoints with high KV cache memory usage.
prefix-cache-scorer: Rewards endpoints that already have the request's prefix in cache.
max-score-picker: Selects the endpoint with the highest total score.

For prefill-decode serving, a PD profile handler routes new requests to prefill pods and continuations to decode pods.

Usage

The scheduler is automatically deployed by the LLMIsvc controller when spec.router.scheduler is configured. Tune scorer weights to optimize for latency (favor prefix cache) or throughput (favor queue balance).

Theoretical Basis

# Scheduler scoring model (NOT implementation code)
For each request:
  For each healthy endpoint:
    score = 0
    score += queue_scorer.Score(endpoint) * weight_queue        (weight: 2)
    score += kv_cache_scorer.Score(endpoint) * weight_kv_cache  (weight: 2)
    score += prefix_scorer.Score(endpoint, request) * weight_prefix (weight: 3)

  selected = max_score_picker.Pick(scores)
  route request → selected endpoint

PD profile handler:
  New request (no KV state) → route to prefill pool
  Continuation (has KV)     → route to decode pool

Related Pages

Implemented By

Implementation:Kserve_Kserve_LLM_Inference_Scheduler

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment