Principle:SeldonIO Seldon core Model Deployment Execution

Property	Value
Principle Name	Model_Deployment_Execution
Overview	The process of registering and loading ML model artifacts onto inference servers.
Workflow	Model_Deployment
Domains	MLOps, Kubernetes
Related Implementation	SeldonIO_Seldon_core_Seldon_Model_Load
Last Updated	2026-02-13 00:00 GMT

Description

Deploying a model in Seldon Core 2 involves submitting the Model CRD to the scheduler (via the Seldon CLI or kubectl), which then assigns the model to a compatible Server, downloads the artifact from storage, and loads it into the runtime (MLServer or Triton). This is the pivotal step that transitions a model definition from a declarative specification into a live, serving endpoint.

The deployment process follows a well-defined sequence:

Submission: The Model CRD is submitted to the Kubernetes API server via seldon model load or kubectl apply
Scheduling: The Seldon scheduler evaluates the model's requirements against available Server capabilities and capacity
Assignment: The scheduler assigns the model to a compatible Server (or queues it if no capacity is available)
Download: The Server uses rclone to download the model artifact from the specified storageUri
Loading: The inference runtime (MLServer or Triton) loads the artifact into memory and initializes it for serving
Ready: The model transitions to a ready state and becomes available for inference requests

Theoretical Basis

Model deployment in Kubernetes follows the reconciliation loop pattern: desired state (the Model CRD) is compared against actual state (what is currently loaded on Servers), and controllers act to converge. This pattern provides:

Eventual consistency: Even if a Server temporarily goes down, the controller will re-deploy the model when the Server recovers
Self-healing: If a model fails to load, the controller can retry or reassign to a different Server
Declarative intent: Users express what they want, not the procedural steps to achieve it

The scheduler optimizes model-to-server assignment based on:

Requirements matching: Model requirements (e.g., sklearn) must be a subset of Server capabilities
Capacity planning: Models declare memory requirements, and Servers have memory limits with configurable overcommit ratios
Affinity rules: Optional server pinning allows operators to control placement for performance or isolation reasons

The two-phase approach (CLI submission followed by scheduler assignment) decouples the user interface from the infrastructure management, allowing the scheduler to make intelligent placement decisions based on cluster-wide state.

Usage

This principle applies after defining a Model resource, when ready to make it available for inference. The deployment can be triggered in two ways:

Using the Seldon CLI

seldon model load -f model.yaml

Using kubectl

kubectl apply -f model.yaml

Both methods submit the Model CRD to the control plane, triggering the scheduling and loading process. The CLI method communicates directly with the Seldon scheduler, while the kubectl method goes through the Kubernetes API server and the Seldon operator reconciles the resource.

After submission, users should verify that the model has been successfully loaded using the readiness verification step before sending inference requests.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment