Workflow:Kserve Kserve Multi Model Serving
| Knowledge Sources | |
|---|---|
| Domains | ML_Serving, Kubernetes, Multi_Model, Cost_Optimization |
| Last Updated | 2026-02-13 14:00 GMT |
Overview
End-to-end process for deploying multiple machine learning models on a shared InferenceService using KServe TrainedModel resources to maximize resource efficiency.
Description
This workflow addresses the scalability challenge of deploying hundreds to thousands of models on a Kubernetes cluster. Instead of one InferenceService per model, multiple models are loaded into a single InferenceService using the TrainedModel custom resource. A model agent sidecar manages model download, loading, and unloading on the model server. This dramatically reduces compute overhead (CPU, memory, GPU), pod count, and IP address consumption. Multi-model serving supports model servers that implement the V2 inference protocol load/unload endpoints, including Triton, scikit-learn, XGBoost, and LightGBM.
Usage
Execute this workflow when you need to deploy a large number of models (tens to thousands) and want to minimize infrastructure costs and cluster resource consumption. This is particularly valuable when models are small enough to coexist on a single model server, when GPU resources must be shared across models, or when cluster pod and IP address limits become bottlenecks.
Execution Steps
Create an InferenceService with the desired framework (sklearn, xgboost, triton, etc.) but without specifying a storageUri. This creates an empty model server ready to accept dynamically loaded models. Enable the multiModelServer flag in the serving runtime configuration.
Key considerations:
- Do not include storageUri in the predictor spec
- Set appropriate resource limits to accommodate multiple models
- The framework must support the V2 protocol load/unload endpoints
- Set minReplicas to ensure the server stays running
Step 2: Create TrainedModel resources
Author TrainedModel YAML manifests for each model to be deployed. Each TrainedModel specifies the target InferenceService, model storageUri, framework, and memory requirements. Apply the TrainedModel resources to the cluster. The trained model controller writes model configurations to a ConfigMap mounted by the InferenceService pod.
Key considerations:
- Each TrainedModel references its parent InferenceService by name
- Specify the memory field to help the controller manage capacity
- Multiple TrainedModels can be applied simultaneously
- Models are downloaded in parallel by the model agent
Step 3: Wait for model agent to load models
The model agent sidecar in the InferenceService pod detects the ConfigMap changes, downloads model artifacts from the specified storageUri, and sends load requests to the model server. Monitor the agent container logs to verify successful model loading.
What happens:
- ConfigMap is updated with model configurations
- Model agent downloads models in parallel using Go routines
- Agent sends load requests to the model server via V2 protocol
- Each model transitions to a loaded state on the server
Step 4: Send predictions to individual models
Send prediction requests to specific models using the model name in the URL path. Each TrainedModel is addressable by its name within the shared InferenceService endpoint. The model server routes the request to the correct loaded model based on the model name.
Key considerations:
- V1 endpoint: /v1/models/{trained_model_name}:predict
- Access via the shared InferenceService hostname
- All models share the same ingress endpoint but are differentiated by name
Step 5: Manage model lifecycle
Add new models by creating additional TrainedModel resources. Remove models by deleting their TrainedModel resource, which triggers the model agent to unload the model from the server and removes it from the ConfigMap. Deleting the parent InferenceService cascades deletion to all associated TrainedModels.
Key considerations:
- Creating a TrainedModel triggers automatic download and loading
- Deleting a TrainedModel triggers automatic unloading
- Deleting the InferenceService deletes all associated TrainedModels
- Models can be added and removed at runtime without service interruption