Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Kserve Kserve Serving Runtime Configuration

From Leeroopedia
Knowledge Sources
Domains MLOps, Model_Serving, Kubernetes
Last Updated 2026-02-13 00:00 GMT

Overview

A pluggable model server abstraction that decouples model format support from inference service definitions through cluster-scoped or namespace-scoped runtime templates.

Description

Serving Runtime Configuration uses the ClusterServingRuntime and ServingRuntime custom resources to define how a particular model format is served. Each runtime declares which model formats it supports (e.g., sklearn, tensorflow, onnx), the container image to use, the default resource requests, supported protocol versions (v1, v2, gRPC), and any multi-model serving capabilities.

When a user creates an InferenceService that specifies a model format but no explicit container, the KServe controller automatically selects the matching ClusterServingRuntime (or namespace-scoped ServingRuntime) and merges its container template with the InferenceService spec. This separation of concerns allows platform teams to manage runtime versions centrally while data scientists focus on model deployment.

Usage

Use this principle when:

  • Adding support for a new model server to a KServe cluster (e.g., Triton, MLServer, OpenVINO, TorchServe)
  • Standardizing container images and resource defaults across teams
  • Enabling multi-node inference with specialized runtimes (e.g., HuggingFace multi-node)
  • Overriding default runtimes at the namespace level

Theoretical Basis

# Serving runtime selection flow (NOT implementation code)
ClusterServingRuntime:
  metadata.name: "kserve-tritonserver"
  spec:
    supportedModelFormats:
      - name: "tensorflow"
        version: "2"
        autoSelect: true
      - name: "onnx"
        autoSelect: true
    containers:
      - name: "kserve-container"
        image: "triton-image:latest"
        resources: ...
    protocolVersions: ["v2"]

Runtime selection algorithm:
  1. User submits InferenceService with modelFormat.name = "tensorflow"
  2. Controller queries all ClusterServingRuntimes
  3. Filter runtimes where supportedModelFormats includes "tensorflow"
  4. Further filter by autoSelect = true (unless runtime explicitly named)
  5. Merge selected runtime container template into pod spec
  6. Apply user overrides (resource limits, env vars, args)

Multi-node variant:
  - workerSpec defined alongside main container
  - Controller creates LeaderWorkerSet instead of single Deployment
  - Workers share the same runtime but with different roles

Related Pages

Implemented By

Related Principles

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment