Principle:Kserve Kserve Serving Runtime Configuration
| Knowledge Sources | |
|---|---|
| Domains | MLOps, Model_Serving, Kubernetes |
| Last Updated | 2026-02-13 00:00 GMT |
Overview
A pluggable model server abstraction that decouples model format support from inference service definitions through cluster-scoped or namespace-scoped runtime templates.
Description
Serving Runtime Configuration uses the ClusterServingRuntime and ServingRuntime custom resources to define how a particular model format is served. Each runtime declares which model formats it supports (e.g., sklearn, tensorflow, onnx), the container image to use, the default resource requests, supported protocol versions (v1, v2, gRPC), and any multi-model serving capabilities.
When a user creates an InferenceService that specifies a model format but no explicit container, the KServe controller automatically selects the matching ClusterServingRuntime (or namespace-scoped ServingRuntime) and merges its container template with the InferenceService spec. This separation of concerns allows platform teams to manage runtime versions centrally while data scientists focus on model deployment.
Usage
Use this principle when:
- Adding support for a new model server to a KServe cluster (e.g., Triton, MLServer, OpenVINO, TorchServe)
- Standardizing container images and resource defaults across teams
- Enabling multi-node inference with specialized runtimes (e.g., HuggingFace multi-node)
- Overriding default runtimes at the namespace level
Theoretical Basis
# Serving runtime selection flow (NOT implementation code)
ClusterServingRuntime:
metadata.name: "kserve-tritonserver"
spec:
supportedModelFormats:
- name: "tensorflow"
version: "2"
autoSelect: true
- name: "onnx"
autoSelect: true
containers:
- name: "kserve-container"
image: "triton-image:latest"
resources: ...
protocolVersions: ["v2"]
Runtime selection algorithm:
1. User submits InferenceService with modelFormat.name = "tensorflow"
2. Controller queries all ClusterServingRuntimes
3. Filter runtimes where supportedModelFormats includes "tensorflow"
4. Further filter by autoSelect = true (unless runtime explicitly named)
5. Merge selected runtime container template into pod spec
6. Apply user overrides (resource limits, env vars, args)
Multi-node variant:
- workerSpec defined alongside main container
- Controller creates LeaderWorkerSet instead of single Deployment
- Workers share the same runtime but with different roles
Related Pages
Implemented By
- Implementation:Kserve_Kserve_HuggingFace_Multinode_Runtime
- Implementation:Kserve_Kserve_MLServer_Runtime
- Implementation:Kserve_Kserve_OpenVINO_Runtime
- Implementation:Kserve_Kserve_Triton_Runtime