Workflow:Kserve Kserve Multi Model Serving

Knowledge Sources	KServe KServe Website Multi-Model Serving Guide
Domains	ML_Serving, Kubernetes, Multi_Model, Cost_Optimization
Last Updated	2026-02-13 14:00 GMT

Overview

End-to-end process for deploying multiple machine learning models on a shared InferenceService using KServe TrainedModel resources to maximize resource efficiency.

Description

This workflow addresses the scalability challenge of deploying hundreds to thousands of models on a Kubernetes cluster. Instead of one InferenceService per model, multiple models are loaded into a single InferenceService using the TrainedModel custom resource. A model agent sidecar manages model download, loading, and unloading on the model server. This dramatically reduces compute overhead (CPU, memory, GPU), pod count, and IP address consumption. Multi-model serving supports model servers that implement the V2 inference protocol load/unload endpoints, including Triton, scikit-learn, XGBoost, and LightGBM.

Usage

Execute this workflow when you need to deploy a large number of models (tens to thousands) and want to minimize infrastructure costs and cluster resource consumption. This is particularly valuable when models are small enough to coexist on a single model server, when GPU resources must be shared across models, or when cluster pod and IP address limits become bottlenecks.

Execution Steps

Step 1: Deploy a shared InferenceService without a model

Create an InferenceService with the desired framework (sklearn, xgboost, triton, etc.) but without specifying a storageUri. This creates an empty model server ready to accept dynamically loaded models. Enable the multiModelServer flag in the serving runtime configuration.

Key considerations:

Do not include storageUri in the predictor spec
Set appropriate resource limits to accommodate multiple models
The framework must support the V2 protocol load/unload endpoints
Set minReplicas to ensure the server stays running

Step 2: Create TrainedModel resources

Author TrainedModel YAML manifests for each model to be deployed. Each TrainedModel specifies the target InferenceService, model storageUri, framework, and memory requirements. Apply the TrainedModel resources to the cluster. The trained model controller writes model configurations to a ConfigMap mounted by the InferenceService pod.

Key considerations:

Each TrainedModel references its parent InferenceService by name
Specify the memory field to help the controller manage capacity
Multiple TrainedModels can be applied simultaneously
Models are downloaded in parallel by the model agent

Step 3: Wait for model agent to load models

The model agent sidecar in the InferenceService pod detects the ConfigMap changes, downloads model artifacts from the specified storageUri, and sends load requests to the model server. Monitor the agent container logs to verify successful model loading.

What happens:

ConfigMap is updated with model configurations
Model agent downloads models in parallel using Go routines
Agent sends load requests to the model server via V2 protocol
Each model transitions to a loaded state on the server

Step 4: Send predictions to individual models

Send prediction requests to specific models using the model name in the URL path. Each TrainedModel is addressable by its name within the shared InferenceService endpoint. The model server routes the request to the correct loaded model based on the model name.

Key considerations:

V1 endpoint: /v1/models/{trained_model_name}:predict
Access via the shared InferenceService hostname
All models share the same ingress endpoint but are differentiated by name

Step 5: Manage model lifecycle

Add new models by creating additional TrainedModel resources. Remove models by deleting their TrainedModel resource, which triggers the model agent to unload the model from the server and removes it from the ConfigMap. Deleting the parent InferenceService cascades deletion to all associated TrainedModels.

Key considerations:

Creating a TrainedModel triggers automatic download and loading
Deleting a TrainedModel triggers automatic unloading
Deleting the InferenceService deletes all associated TrainedModels
Models can be added and removed at runtime without service interruption

Execution Diagram

GitHub URL

Workflow Repository