Implementation:Kserve Kserve LLM Scheduler Config
| Knowledge Sources | |
|---|---|
| Domains | Kubernetes, LLM Serving |
| Last Updated | 2026-02-13 00:00 GMT |
Overview
Concrete LLMInferenceServiceConfig for the LLM inference scheduler and InferencePool endpoint picker provided by the KServe project.
Description
This file defines the configuration template for the LLM inference scheduler (endpoint picker proxy) and InferencePool that routes requests to model-serving pods based on KV-cache utilization metrics. It specifies an LLMInferenceServiceConfig with a router.scheduler section containing an InferencePool spec with endpoint picker configuration (gRPC on port 9002, failureMode FailOpen) and a scheduler deployment template running llm-d-inference-scheduler:v0.4.0. The scheduler monitors the vllm:kv_cache_usage_perc metric for intelligent, KV-cache-aware load balancing across model serving replicas. This implements the routing principle described in Kserve_Kserve_PD_Scheduler_Routing.
Usage
Apply this configuration as part of the LLM serving setup. The LLMInferenceService controller uses this template to create the inference scheduler and InferencePool resources that handle intelligent request routing. The scheduler integrates with the Gateway API InferencePool mechanism for KV-cache-aware request distribution.
Code Reference
Source Location
- Repository: Kserve_Kserve
- File: config/llmisvcconfig/config-llm-scheduler.yaml
- Lines: 1-106
Signature
apiVersion: serving.kserve.io/v1alpha2
kind: LLMInferenceServiceConfig
metadata:
name: kserve-config-llm-scheduler
spec:
router:
scheduler:
pool:
spec:
endpointPickerRef:
failureMode: FailOpen
kind: Service
name: |-
{{ ChildName .ObjectMeta.Name `-epp-service` }}
port:
number: 9002
selector:
matchLabels:
app.kubernetes.io/name: |-
{{ .ObjectMeta.Name }}
app.kubernetes.io/part-of: llminferenceservice
kserve.io/component: workload
targetPorts:
- number: 8000
template:
containers:
- name: main
image: ghcr.io/llm-d/llm-d-inference-scheduler:v0.4.0
ports:
- containerPort: 9002
name: grpc
- containerPort: 9003
name: grpc-health
- containerPort: 9090
name: metrics
- containerPort: 5557
name: zmq
args:
- --pool-name
- "{{ ChildName .ObjectMeta.Name `-inference-pool` }}"
- --pool-namespace
- "{{ .ObjectMeta.Namespace }}"
- --kv-cache-usage-percentage-metric
- "vllm:kv_cache_usage_perc"
Import
kubectl apply -f config/llmisvcconfig/config-llm-scheduler.yaml
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| .ObjectMeta.Name | Go template variable | Yes | Used to derive InferencePool and EPP service names |
| .ObjectMeta.Namespace | Go template variable | Yes | Namespace for the inference pool |
| vllm:kv_cache_usage_perc | Prometheus metric | Yes | KV-cache utilization metric from vLLM pods used for load balancing |
Outputs
| Name | Type | Description |
|---|---|---|
| LLMInferenceServiceConfig | Custom Resource | Scheduler and InferencePool template consumed by the LLMIsvc controller |
| InferencePool | Gateway API resource | Pool of model-serving endpoints with label-based selector |
| Scheduler Deployment | Deployment | Runs the llm-d-inference-scheduler for KV-cache-aware request routing |
| gRPC endpoint | TCP port 9002 | Endpoint picker service for the Gateway API |
| Metrics endpoint | TCP port 9090 | Prometheus metrics from the scheduler |
Usage Examples
Apply the scheduler config
kubectl apply -f config/llmisvcconfig/config-llm-scheduler.yaml
Verify the config is present
kubectl get llminferenceserviceconfig kserve-config-llm-scheduler