Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Kserve Kserve Multi Model Serving

From Leeroopedia
Knowledge Sources
Domains ML_Serving, Kubernetes, Multi_Model, Cost_Optimization
Last Updated 2026-02-13 14:00 GMT

Overview

End-to-end process for deploying multiple machine learning models on a shared InferenceService using KServe TrainedModel resources to maximize resource efficiency.

Description

This workflow addresses the scalability challenge of deploying hundreds to thousands of models on a Kubernetes cluster. Instead of one InferenceService per model, multiple models are loaded into a single InferenceService using the TrainedModel custom resource. A model agent sidecar manages model download, loading, and unloading on the model server. This dramatically reduces compute overhead (CPU, memory, GPU), pod count, and IP address consumption. Multi-model serving supports model servers that implement the V2 inference protocol load/unload endpoints, including Triton, scikit-learn, XGBoost, and LightGBM.

Usage

Execute this workflow when you need to deploy a large number of models (tens to thousands) and want to minimize infrastructure costs and cluster resource consumption. This is particularly valuable when models are small enough to coexist on a single model server, when GPU resources must be shared across models, or when cluster pod and IP address limits become bottlenecks.

Execution Steps

Step 1: Deploy a shared InferenceService without a model

Create an InferenceService with the desired framework (sklearn, xgboost, triton, etc.) but without specifying a storageUri. This creates an empty model server ready to accept dynamically loaded models. Enable the multiModelServer flag in the serving runtime configuration.

Key considerations:

  • Do not include storageUri in the predictor spec
  • Set appropriate resource limits to accommodate multiple models
  • The framework must support the V2 protocol load/unload endpoints
  • Set minReplicas to ensure the server stays running

Step 2: Create TrainedModel resources

Author TrainedModel YAML manifests for each model to be deployed. Each TrainedModel specifies the target InferenceService, model storageUri, framework, and memory requirements. Apply the TrainedModel resources to the cluster. The trained model controller writes model configurations to a ConfigMap mounted by the InferenceService pod.

Key considerations:

  • Each TrainedModel references its parent InferenceService by name
  • Specify the memory field to help the controller manage capacity
  • Multiple TrainedModels can be applied simultaneously
  • Models are downloaded in parallel by the model agent

Step 3: Wait for model agent to load models

The model agent sidecar in the InferenceService pod detects the ConfigMap changes, downloads model artifacts from the specified storageUri, and sends load requests to the model server. Monitor the agent container logs to verify successful model loading.

What happens:

  • ConfigMap is updated with model configurations
  • Model agent downloads models in parallel using Go routines
  • Agent sends load requests to the model server via V2 protocol
  • Each model transitions to a loaded state on the server

Step 4: Send predictions to individual models

Send prediction requests to specific models using the model name in the URL path. Each TrainedModel is addressable by its name within the shared InferenceService endpoint. The model server routes the request to the correct loaded model based on the model name.

Key considerations:

  • V1 endpoint: /v1/models/{trained_model_name}:predict
  • Access via the shared InferenceService hostname
  • All models share the same ingress endpoint but are differentiated by name

Step 5: Manage model lifecycle

Add new models by creating additional TrainedModel resources. Remove models by deleting their TrainedModel resource, which triggers the model agent to unload the model from the server and removes it from the ConfigMap. Deleting the parent InferenceService cascades deletion to all associated TrainedModels.

Key considerations:

  • Creating a TrainedModel triggers automatic download and loading
  • Deleting a TrainedModel triggers automatic unloading
  • Deleting the InferenceService deletes all associated TrainedModels
  • Models can be added and removed at runtime without service interruption

Execution Diagram

GitHub URL

Workflow Repository