Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:SeldonIO Seldon core Model Deployment

From Leeroopedia
Knowledge Sources
Domains MLOps, Model_Serving, Kubernetes
Last Updated 2026-02-13 14:00 GMT

Overview

End-to-end process for deploying a pre-trained machine learning model on Seldon Core 2 and serving predictions via REST or gRPC.

Description

This workflow covers the foundational operation in Seldon Core 2: taking a trained model artifact stored in cloud or local storage and making it available as a production inference endpoint. The process involves defining a Model custom resource with its storage location, framework requirements, and memory allocation, then loading it onto an inference server (MLServer or Triton) where it becomes available for real-time predictions. Seldon Core 2 supports a wide range of ML frameworks including scikit-learn, TensorFlow, PyTorch, ONNX, XGBoost, LightGBM, MLflow, and HuggingFace transformers.

Usage

Execute this workflow when you have a trained model artifact (e.g., a serialized sklearn model, a TensorFlow SavedModel, or an ONNX file) stored in accessible storage (GCS, S3, MinIO, or local filesystem) and need to expose it as a scalable inference endpoint. This is the starting point for all Seldon Core 2 deployments, whether running locally via Docker Compose or on a Kubernetes cluster.

Execution Steps

Step 1: Prepare Model Artifact

Ensure the trained model artifact is stored in an accessible location with the correct directory structure expected by the inference server. For MLServer-based models, a model-settings.json file must accompany the model artifact, specifying the model name, implementation class, and any parameters. For Triton-based models, the standard Triton model repository layout with version directories is required.

Key considerations:

  • Model artifacts must include framework-specific metadata (model-settings.json for MLServer, config.pbtxt for Triton)
  • Supported storage backends include Google Cloud Storage (gs://), Amazon S3 (s3://), MinIO, and local filesystem paths
  • For authenticated storage, configure rclone secrets with the appropriate credentials

Step 2: Define Model Resource

Create a Seldon Model custom resource manifest (YAML) that declares the model's storage URI, framework requirements, and optional memory allocation. The requirements field determines which inference server the model is assigned to, and the memory field controls how much server memory the model is allocated.

Key considerations:

  • The storageUri must point to the directory containing the model artifact and settings
  • Requirements (e.g., sklearn, tensorflow, huggingface) must match available server capabilities
  • Memory allocation (e.g., 100Ki) influences multi-model serving scheduling decisions

Step 3: Deploy Model

Apply the Model resource to the cluster (via kubectl or seldon CLI). The Seldon operator reconciles the resource, the scheduler assigns it to a compatible inference server, and the agent downloads the model artifact from storage to the server pod.

Key considerations:

  • In Kubernetes mode, apply to the namespace where SeldonRuntime is configured (typically seldon-mesh)
  • In local mode, use the seldon CLI directly against the scheduler endpoint
  • The model transitions through states: ScheduleRequested, Scheduled, Loading, Loaded, Available

Step 4: Verify Readiness

Wait for the model to reach the ModelAvailable condition, confirming that the model has been downloaded, loaded into the inference server, and is ready to serve predictions. Query the model status to verify successful deployment and check for any error conditions.

Key considerations:

  • Use blocking wait commands with timeout to avoid indefinite polling
  • Check model status for detailed error messages if the model fails to load
  • Common issues include incorrect storageUri paths, missing requirements, or insufficient server memory

Step 5: Run Inference

Send prediction requests to the deployed model using the V2 inference protocol via REST or gRPC. Requests follow the Open Inference Protocol format with named inputs containing typed tensor data.

Key considerations:

  • REST endpoint follows the pattern: /v2/models/{model_name}/infer
  • gRPC uses the standard V2 inference service definition
  • Both REST and gRPC are available simultaneously on the same model
  • Batch inference is supported by sending multiple requests in sequence

Step 6: Manage Lifecycle

Unload the model when it is no longer needed to free server resources. Models can be updated by re-applying modified manifests (e.g., changing the storageUri for a new version), which triggers a rolling update.

Key considerations:

  • Unloading removes the model from the server but does not delete the artifact from storage
  • Rolling updates allow zero-downtime model version changes
  • Multiple model versions can coexist for gradual migration

Execution Diagram

GitHub URL

Workflow Repository