Implementation:Kserve Kserve LLM Inference Scheduler

Knowledge Sources	KServe LLM-d Scheduler
Domains	Scheduling, LLM_Serving, Traffic_Management
Last Updated	2026-02-13 00:00 GMT

Overview

Concrete scheduler deployment configuration using the llm-d-inference-scheduler with InferencePool endpoint picker and scorer plugin chain.

Description

The scheduler is deployed as a Kubernetes Service with the ghcr.io/llm-d/llm-d-inference-scheduler image. It serves as an endpoint picker for the Gateway Inference Extension's InferencePool. The configuration defines scorer plugins with weights, health probes, and the connection to vLLM metrics.

Usage

The scheduler is automatically deployed by the LLMIsvc controller based on the config-llm-scheduler.yaml template. Customization is done by modifying the LLMInferenceServiceConfig.

Code Reference

Source Location

Repository: kserve
File: config/llmisvcconfig/config-llm-scheduler.yaml, Lines 1-107
File: docs/samples/llmisvc/precise-prefix-kv-cache-routing/llm-inference-service-qwen2-7b-gpu-kv-cache-routing.yaml, Lines 1-82

Signature

# InferencePool endpoint picker configuration
apiVersion: inference.networking.x-k8s.io/v1alpha1
kind: InferencePool
spec:
  endpointPickerRef:
    kind: Service
    port: 9002
    failureMode: FailOpen

# Scheduler pod
image: ghcr.io/llm-d/llm-d-inference-scheduler:v0.4.0
ports:
  - containerPort: 9002   # gRPC endpoint picker
  - containerPort: 9003   # Health check
  - containerPort: 9090   # Prometheus metrics
  - containerPort: 5557   # ZMQ KV cache events
args:
  - --kv-cache-usage-percentage-metric=vllm:kv_cache_usage_perc

Import

# External dependency — deployed automatically by LLMIsvc controller
# Image: ghcr.io/llm-d/llm-d-inference-scheduler:v0.4.0

I/O Contract

Inputs

Name	Type	Required	Description
gRPC request	ExtProc	Yes	Envoy external processor request with inference metadata
vLLM metrics	Prometheus	Yes	KV cache usage, queue depth per endpoint
ZMQ events	ZMQ	No	KV cache event stream from vLLM pods

Outputs

Name	Type	Description
Selected endpoint	gRPC response	Endpoint address for request routing
Metrics	Prometheus	Scheduler decision metrics on port 9090

Usage Examples

Custom Scorer Weights

# In LLMInferenceService with custom routing
spec:
  router:
    scheduler:
      scorerPlugins:
        - name: queue-scorer
          weight: 2
        - name: kv-cache-utilization-scorer
          weight: 2
        - name: prefix-cache-scorer
          weight: 3
      pickerPlugin:
        name: max-score-picker

# Check scheduler logs
kubectl logs -l app.kubernetes.io/component=llminferenceservice-scheduler

# Check scheduler metrics
kubectl port-forward svc/scheduler 9090:9090
curl http://localhost:9090/metrics | grep "endpoint_picker"

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment