Implementation:Kserve Kserve LLM Scheduler Config

Knowledge Sources	Kserve_Kserve KServe Docs
Domains	Kubernetes, LLM Serving
Last Updated	2026-02-13 00:00 GMT

Overview

Concrete LLMInferenceServiceConfig for the LLM inference scheduler and InferencePool endpoint picker provided by the KServe project.

Description

This file defines the configuration template for the LLM inference scheduler (endpoint picker proxy) and InferencePool that routes requests to model-serving pods based on KV-cache utilization metrics. It specifies an LLMInferenceServiceConfig with a router.scheduler section containing an InferencePool spec with endpoint picker configuration (gRPC on port 9002, failureMode FailOpen) and a scheduler deployment template running llm-d-inference-scheduler:v0.4.0. The scheduler monitors the vllm:kv_cache_usage_perc metric for intelligent, KV-cache-aware load balancing across model serving replicas. This implements the routing principle described in Kserve_Kserve_PD_Scheduler_Routing.

Usage

Apply this configuration as part of the LLM serving setup. The LLMInferenceService controller uses this template to create the inference scheduler and InferencePool resources that handle intelligent request routing. The scheduler integrates with the Gateway API InferencePool mechanism for KV-cache-aware request distribution.

Code Reference

Source Location

Repository: Kserve_Kserve
File: config/llmisvcconfig/config-llm-scheduler.yaml
Lines: 1-106

Signature

apiVersion: serving.kserve.io/v1alpha2
kind: LLMInferenceServiceConfig
metadata:
  name: kserve-config-llm-scheduler
spec:
  router:
    scheduler:
      pool:
        spec:
          endpointPickerRef:
            failureMode: FailOpen
            kind: Service
            name: |-
              {{ ChildName .ObjectMeta.Name `-epp-service` }}
            port:
              number: 9002
          selector:
            matchLabels:
              app.kubernetes.io/name: |-
                {{ .ObjectMeta.Name }}
              app.kubernetes.io/part-of: llminferenceservice
              kserve.io/component: workload
          targetPorts:
            - number: 8000
      template:
        containers:
          - name: main
            image: ghcr.io/llm-d/llm-d-inference-scheduler:v0.4.0
            ports:
              - containerPort: 9002
                name: grpc
              - containerPort: 9003
                name: grpc-health
              - containerPort: 9090
                name: metrics
              - containerPort: 5557
                name: zmq
            args:
              - --pool-name
              - "{{ ChildName .ObjectMeta.Name `-inference-pool` }}"
              - --pool-namespace
              - "{{ .ObjectMeta.Namespace }}"
              - --kv-cache-usage-percentage-metric
              - "vllm:kv_cache_usage_perc"

Import

kubectl apply -f config/llmisvcconfig/config-llm-scheduler.yaml

I/O Contract

Inputs

Name	Type	Required	Description
.ObjectMeta.Name	Go template variable	Yes	Used to derive InferencePool and EPP service names
.ObjectMeta.Namespace	Go template variable	Yes	Namespace for the inference pool
vllm:kv_cache_usage_perc	Prometheus metric	Yes	KV-cache utilization metric from vLLM pods used for load balancing

Outputs

Name	Type	Description
LLMInferenceServiceConfig	Custom Resource	Scheduler and InferencePool template consumed by the LLMIsvc controller
InferencePool	Gateway API resource	Pool of model-serving endpoints with label-based selector
Scheduler Deployment	Deployment	Runs the llm-d-inference-scheduler for KV-cache-aware request routing
gRPC endpoint	TCP port 9002	Endpoint picker service for the Gateway API
Metrics endpoint	TCP port 9090	Prometheus metrics from the scheduler

Usage Examples

Apply the scheduler config

kubectl apply -f config/llmisvcconfig/config-llm-scheduler.yaml

Verify the config is present

kubectl get llminferenceserviceconfig kserve-config-llm-scheduler

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment