Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:SeldonIO Seldon core Model Deployment Execution

From Leeroopedia
Property Value
Principle Name Model_Deployment_Execution
Overview The process of registering and loading ML model artifacts onto inference servers.
Workflow Model_Deployment
Domains MLOps, Kubernetes
Related Implementation SeldonIO_Seldon_core_Seldon_Model_Load
Last Updated 2026-02-13 00:00 GMT

Description

Deploying a model in Seldon Core 2 involves submitting the Model CRD to the scheduler (via the Seldon CLI or kubectl), which then assigns the model to a compatible Server, downloads the artifact from storage, and loads it into the runtime (MLServer or Triton). This is the pivotal step that transitions a model definition from a declarative specification into a live, serving endpoint.

The deployment process follows a well-defined sequence:

  1. Submission: The Model CRD is submitted to the Kubernetes API server via seldon model load or kubectl apply
  2. Scheduling: The Seldon scheduler evaluates the model's requirements against available Server capabilities and capacity
  3. Assignment: The scheduler assigns the model to a compatible Server (or queues it if no capacity is available)
  4. Download: The Server uses rclone to download the model artifact from the specified storageUri
  5. Loading: The inference runtime (MLServer or Triton) loads the artifact into memory and initializes it for serving
  6. Ready: The model transitions to a ready state and becomes available for inference requests

Theoretical Basis

Model deployment in Kubernetes follows the reconciliation loop pattern: desired state (the Model CRD) is compared against actual state (what is currently loaded on Servers), and controllers act to converge. This pattern provides:

  • Eventual consistency: Even if a Server temporarily goes down, the controller will re-deploy the model when the Server recovers
  • Self-healing: If a model fails to load, the controller can retry or reassign to a different Server
  • Declarative intent: Users express what they want, not the procedural steps to achieve it

The scheduler optimizes model-to-server assignment based on:

  • Requirements matching: Model requirements (e.g., sklearn) must be a subset of Server capabilities
  • Capacity planning: Models declare memory requirements, and Servers have memory limits with configurable overcommit ratios
  • Affinity rules: Optional server pinning allows operators to control placement for performance or isolation reasons

The two-phase approach (CLI submission followed by scheduler assignment) decouples the user interface from the infrastructure management, allowing the scheduler to make intelligent placement decisions based on cluster-wide state.

Usage

This principle applies after defining a Model resource, when ready to make it available for inference. The deployment can be triggered in two ways:

Using the Seldon CLI

seldon model load -f model.yaml

Using kubectl

kubectl apply -f model.yaml

Both methods submit the Model CRD to the control plane, triggering the scheduling and loading process. The CLI method communicates directly with the Seldon scheduler, while the kubectl method goes through the Kubernetes API server and the Seldon operator reconciles the resource.

After submission, users should verify that the model has been successfully loaded using the readiness verification step before sending inference requests.

Related Pages

Implementation:SeldonIO_Seldon_core_Seldon_Model_Load

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment