Principle:SeldonIO Seldon core Model Deployment Execution
| Property | Value |
|---|---|
| Principle Name | Model_Deployment_Execution |
| Overview | The process of registering and loading ML model artifacts onto inference servers. |
| Workflow | Model_Deployment |
| Domains | MLOps, Kubernetes |
| Related Implementation | SeldonIO_Seldon_core_Seldon_Model_Load |
| Last Updated | 2026-02-13 00:00 GMT |
Description
Deploying a model in Seldon Core 2 involves submitting the Model CRD to the scheduler (via the Seldon CLI or kubectl), which then assigns the model to a compatible Server, downloads the artifact from storage, and loads it into the runtime (MLServer or Triton). This is the pivotal step that transitions a model definition from a declarative specification into a live, serving endpoint.
The deployment process follows a well-defined sequence:
- Submission: The Model CRD is submitted to the Kubernetes API server via
seldon model loadorkubectl apply - Scheduling: The Seldon scheduler evaluates the model's requirements against available Server capabilities and capacity
- Assignment: The scheduler assigns the model to a compatible Server (or queues it if no capacity is available)
- Download: The Server uses rclone to download the model artifact from the specified
storageUri - Loading: The inference runtime (MLServer or Triton) loads the artifact into memory and initializes it for serving
- Ready: The model transitions to a ready state and becomes available for inference requests
Theoretical Basis
Model deployment in Kubernetes follows the reconciliation loop pattern: desired state (the Model CRD) is compared against actual state (what is currently loaded on Servers), and controllers act to converge. This pattern provides:
- Eventual consistency: Even if a Server temporarily goes down, the controller will re-deploy the model when the Server recovers
- Self-healing: If a model fails to load, the controller can retry or reassign to a different Server
- Declarative intent: Users express what they want, not the procedural steps to achieve it
The scheduler optimizes model-to-server assignment based on:
- Requirements matching: Model requirements (e.g.,
sklearn) must be a subset of Server capabilities - Capacity planning: Models declare memory requirements, and Servers have memory limits with configurable overcommit ratios
- Affinity rules: Optional server pinning allows operators to control placement for performance or isolation reasons
The two-phase approach (CLI submission followed by scheduler assignment) decouples the user interface from the infrastructure management, allowing the scheduler to make intelligent placement decisions based on cluster-wide state.
Usage
This principle applies after defining a Model resource, when ready to make it available for inference. The deployment can be triggered in two ways:
Using the Seldon CLI
seldon model load -f model.yaml
Using kubectl
kubectl apply -f model.yaml
Both methods submit the Model CRD to the control plane, triggering the scheduling and loading process. The CLI method communicates directly with the Seldon scheduler, while the kubectl method goes through the Kubernetes API server and the Seldon operator reconciles the resource.
After submission, users should verify that the model has been successfully loaded using the readiness verification step before sending inference requests.
Related Pages
- SeldonIO_Seldon_core_Seldon_Model_Load implements SeldonIO_Seldon_core_Model_Deployment_Execution
- SeldonIO_Seldon_core_Model_Resource_Definition precedes SeldonIO_Seldon_core_Model_Deployment_Execution
- SeldonIO_Seldon_core_Model_Readiness_Verification follows SeldonIO_Seldon_core_Model_Deployment_Execution
- SeldonIO_Seldon_core_Model_Artifact_Preparation is required by SeldonIO_Seldon_core_Model_Deployment_Execution
- Heuristic:SeldonIO_Seldon_core_Model_Scheduling_Preference_Tip
- Heuristic:SeldonIO_Seldon_core_Autoscaling_Dual_Config_Tip