Implementation:Kserve Kserve LLM Inference Scheduler
| Knowledge Sources | |
|---|---|
| Domains | Scheduling, LLM_Serving, Traffic_Management |
| Last Updated | 2026-02-13 00:00 GMT |
Overview
Concrete scheduler deployment configuration using the llm-d-inference-scheduler with InferencePool endpoint picker and scorer plugin chain.
Description
The scheduler is deployed as a Kubernetes Service with the ghcr.io/llm-d/llm-d-inference-scheduler image. It serves as an endpoint picker for the Gateway Inference Extension's InferencePool. The configuration defines scorer plugins with weights, health probes, and the connection to vLLM metrics.
Usage
The scheduler is automatically deployed by the LLMIsvc controller based on the config-llm-scheduler.yaml template. Customization is done by modifying the LLMInferenceServiceConfig.
Code Reference
Source Location
- Repository: kserve
- File: config/llmisvcconfig/config-llm-scheduler.yaml, Lines 1-107
- File: docs/samples/llmisvc/precise-prefix-kv-cache-routing/llm-inference-service-qwen2-7b-gpu-kv-cache-routing.yaml, Lines 1-82
Signature
# InferencePool endpoint picker configuration
apiVersion: inference.networking.x-k8s.io/v1alpha1
kind: InferencePool
spec:
endpointPickerRef:
kind: Service
port: 9002
failureMode: FailOpen
# Scheduler pod
image: ghcr.io/llm-d/llm-d-inference-scheduler:v0.4.0
ports:
- containerPort: 9002 # gRPC endpoint picker
- containerPort: 9003 # Health check
- containerPort: 9090 # Prometheus metrics
- containerPort: 5557 # ZMQ KV cache events
args:
- --kv-cache-usage-percentage-metric=vllm:kv_cache_usage_perc
Import
# External dependency — deployed automatically by LLMIsvc controller
# Image: ghcr.io/llm-d/llm-d-inference-scheduler:v0.4.0
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| gRPC request | ExtProc | Yes | Envoy external processor request with inference metadata |
| vLLM metrics | Prometheus | Yes | KV cache usage, queue depth per endpoint |
| ZMQ events | ZMQ | No | KV cache event stream from vLLM pods |
Outputs
| Name | Type | Description |
|---|---|---|
| Selected endpoint | gRPC response | Endpoint address for request routing |
| Metrics | Prometheus | Scheduler decision metrics on port 9090 |
Usage Examples
Custom Scorer Weights
# In LLMInferenceService with custom routing
spec:
router:
scheduler:
scorerPlugins:
- name: queue-scorer
weight: 2
- name: kv-cache-utilization-scorer
weight: 2
- name: prefix-cache-scorer
weight: 3
pickerPlugin:
name: max-score-picker
# Check scheduler logs
kubectl logs -l app.kubernetes.io/component=llminferenceservice-scheduler
# Check scheduler metrics
kubectl port-forward svc/scheduler 9090:9090
curl http://localhost:9090/metrics | grep "endpoint_picker"