Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Kserve Kserve LLM Inference Scheduler

From Leeroopedia
Knowledge Sources
Domains Scheduling, LLM_Serving, Traffic_Management
Last Updated 2026-02-13 00:00 GMT

Overview

Concrete scheduler deployment configuration using the llm-d-inference-scheduler with InferencePool endpoint picker and scorer plugin chain.

Description

The scheduler is deployed as a Kubernetes Service with the ghcr.io/llm-d/llm-d-inference-scheduler image. It serves as an endpoint picker for the Gateway Inference Extension's InferencePool. The configuration defines scorer plugins with weights, health probes, and the connection to vLLM metrics.

Usage

The scheduler is automatically deployed by the LLMIsvc controller based on the config-llm-scheduler.yaml template. Customization is done by modifying the LLMInferenceServiceConfig.

Code Reference

Source Location

  • Repository: kserve
  • File: config/llmisvcconfig/config-llm-scheduler.yaml, Lines 1-107
  • File: docs/samples/llmisvc/precise-prefix-kv-cache-routing/llm-inference-service-qwen2-7b-gpu-kv-cache-routing.yaml, Lines 1-82

Signature

# InferencePool endpoint picker configuration
apiVersion: inference.networking.x-k8s.io/v1alpha1
kind: InferencePool
spec:
  endpointPickerRef:
    kind: Service
    port: 9002
    failureMode: FailOpen

# Scheduler pod
image: ghcr.io/llm-d/llm-d-inference-scheduler:v0.4.0
ports:
  - containerPort: 9002   # gRPC endpoint picker
  - containerPort: 9003   # Health check
  - containerPort: 9090   # Prometheus metrics
  - containerPort: 5557   # ZMQ KV cache events
args:
  - --kv-cache-usage-percentage-metric=vllm:kv_cache_usage_perc

Import

# External dependency — deployed automatically by LLMIsvc controller
# Image: ghcr.io/llm-d/llm-d-inference-scheduler:v0.4.0

I/O Contract

Inputs

Name Type Required Description
gRPC request ExtProc Yes Envoy external processor request with inference metadata
vLLM metrics Prometheus Yes KV cache usage, queue depth per endpoint
ZMQ events ZMQ No KV cache event stream from vLLM pods

Outputs

Name Type Description
Selected endpoint gRPC response Endpoint address for request routing
Metrics Prometheus Scheduler decision metrics on port 9090

Usage Examples

Custom Scorer Weights

# In LLMInferenceService with custom routing
spec:
  router:
    scheduler:
      scorerPlugins:
        - name: queue-scorer
          weight: 2
        - name: kv-cache-utilization-scorer
          weight: 2
        - name: prefix-cache-scorer
          weight: 3
      pickerPlugin:
        name: max-score-picker
# Check scheduler logs
kubectl logs -l app.kubernetes.io/component=llminferenceservice-scheduler

# Check scheduler metrics
kubectl port-forward svc/scheduler 9090:9090
curl http://localhost:9090/metrics | grep "endpoint_picker"

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment