Principle:Kserve Kserve Serving Runtime Configuration

Knowledge Sources	Kserve_Kserve KServe Docs KServe Serving Runtimes
Domains	MLOps, Model_Serving, Kubernetes
Last Updated	2026-02-13 00:00 GMT

Overview

A pluggable model server abstraction that decouples model format support from inference service definitions through cluster-scoped or namespace-scoped runtime templates.

Description

Serving Runtime Configuration uses the ClusterServingRuntime and ServingRuntime custom resources to define how a particular model format is served. Each runtime declares which model formats it supports (e.g., sklearn, tensorflow, onnx), the container image to use, the default resource requests, supported protocol versions (v1, v2, gRPC), and any multi-model serving capabilities.

When a user creates an InferenceService that specifies a model format but no explicit container, the KServe controller automatically selects the matching ClusterServingRuntime (or namespace-scoped ServingRuntime) and merges its container template with the InferenceService spec. This separation of concerns allows platform teams to manage runtime versions centrally while data scientists focus on model deployment.

Usage

Use this principle when:

Adding support for a new model server to a KServe cluster (e.g., Triton, MLServer, OpenVINO, TorchServe)
Standardizing container images and resource defaults across teams
Enabling multi-node inference with specialized runtimes (e.g., HuggingFace multi-node)
Overriding default runtimes at the namespace level

Theoretical Basis

# Serving runtime selection flow (NOT implementation code)
ClusterServingRuntime:
  metadata.name: "kserve-tritonserver"
  spec:
    supportedModelFormats:
      - name: "tensorflow"
        version: "2"
        autoSelect: true
      - name: "onnx"
        autoSelect: true
    containers:
      - name: "kserve-container"
        image: "triton-image:latest"
        resources: ...
    protocolVersions: ["v2"]

Runtime selection algorithm:
  1. User submits InferenceService with modelFormat.name = "tensorflow"
  2. Controller queries all ClusterServingRuntimes
  3. Filter runtimes where supportedModelFormats includes "tensorflow"
  4. Further filter by autoSelect = true (unless runtime explicitly named)
  5. Merge selected runtime container template into pod spec
  6. Apply user overrides (resource limits, env vars, args)

Multi-node variant:
  - workerSpec defined alongside main container
  - Controller creates LeaderWorkerSet instead of single Deployment
  - Workers share the same runtime but with different roles

Related Pages

Implemented By

Related Principles

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment