Principle:SeldonIO Seldon core Model Lifecycle Management

Property	Value
Principle Name	Model_Lifecycle_Management
Overview	Operational procedures for unloading, updating, and rolling over ML model versions in production.
Workflow	Model_Deployment
Domains	MLOps, Kubernetes
Related Implementation	SeldonIO_Seldon_core_Seldon_Model_Unload
Last Updated	2026-02-13 00:00 GMT

Description

Model lifecycle management includes unloading unused models (freeing server resources), performing rolling updates by resubmitting Model CRDs with updated storageUri versions, and managing the model-to-server assignment during transitions. This principle covers the operational phase that follows initial deployment, addressing the ongoing need to maintain, update, and decommission models in a production Seldon Core 2 environment.

The key lifecycle operations are:

Unloading: Removing a model from an inference server to free memory and compute resources, using seldon model unload or kubectl delete model
Rolling update: Submitting an updated Model CRD with a new storageUri (pointing to a newer model version), which triggers the scheduler to gracefully transition from the old version to the new one
Capacity management: Leveraging server overcommit to host more models than can fit in memory simultaneously, with the scheduler evicting and reloading models on demand based on usage patterns

Theoretical Basis

Rolling updates in Seldon Core 2 follow a version progression pattern: submitting an updated Model CRD triggers the scheduler to load the new version while gradually draining the old one. This approach ensures zero-downtime model updates by maintaining at least one active version at all times during the transition.

The lifecycle management model is based on several key concepts:

Server Overcommit

Seldon Core 2 supports server overcommit, allowing more models to be assigned to a Server than can fit in memory simultaneously. The scheduler uses an LRU (Least Recently Used) eviction strategy to manage which models are actively loaded. This enables:

Cost efficiency: A single Server can serve hundreds of models, with only the frequently accessed ones kept in memory
Graceful degradation: Infrequently used models experience a cold-start delay when first accessed but remain available
Resource optimization: Memory is dynamically allocated to the models that need it most

Version Transition

When a model is updated (new storageUri), the scheduler orchestrates a controlled transition:

New version load: The updated model artifact is downloaded and loaded on the Server
Traffic shift: Once the new version is ready, inference requests are routed to it
Old version drain: The previous version is unloaded after all in-flight requests complete

This is analogous to Kubernetes Deployment rolling updates but operates at the model level within a shared inference server.

Model Unloading

Unloading a model completely removes it from the scheduler's assignment and frees all associated resources. This is a permanent operation (unlike eviction, which is temporary). Unloading is appropriate when:

A model is being decommissioned and will no longer receive requests
A model needs to be replaced by a fundamentally different model (not just a version update)
Server capacity needs to be reclaimed for higher-priority models

Usage

This principle applies when decommissioning models, updating to new model versions, or managing server capacity in a Seldon Core 2 deployment.

Unloading a Model

# Unload via Seldon CLI
seldon model unload iris

# Unload via kubectl
kubectl delete model iris

Rolling Update to New Version

# Update the storageUri in the Model YAML to point to the new version
# Then resubmit:
seldon model load -f updated-model.yaml

# The scheduler will handle the transition from old to new version
seldon model status iris -w ModelAvailable

Monitoring Server Capacity

# Check which models are loaded on a server
seldon server status mlserver-0

# List all models and their states
kubectl get models -o wide

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment