Workflow:SeldonIO Seldon core Model Deployment
| Knowledge Sources | |
|---|---|
| Domains | MLOps, Model_Serving, Kubernetes |
| Last Updated | 2026-02-13 14:00 GMT |
Overview
End-to-end process for deploying a pre-trained machine learning model on Seldon Core 2 and serving predictions via REST or gRPC.
Description
This workflow covers the foundational operation in Seldon Core 2: taking a trained model artifact stored in cloud or local storage and making it available as a production inference endpoint. The process involves defining a Model custom resource with its storage location, framework requirements, and memory allocation, then loading it onto an inference server (MLServer or Triton) where it becomes available for real-time predictions. Seldon Core 2 supports a wide range of ML frameworks including scikit-learn, TensorFlow, PyTorch, ONNX, XGBoost, LightGBM, MLflow, and HuggingFace transformers.
Usage
Execute this workflow when you have a trained model artifact (e.g., a serialized sklearn model, a TensorFlow SavedModel, or an ONNX file) stored in accessible storage (GCS, S3, MinIO, or local filesystem) and need to expose it as a scalable inference endpoint. This is the starting point for all Seldon Core 2 deployments, whether running locally via Docker Compose or on a Kubernetes cluster.
Execution Steps
Step 1: Prepare Model Artifact
Ensure the trained model artifact is stored in an accessible location with the correct directory structure expected by the inference server. For MLServer-based models, a model-settings.json file must accompany the model artifact, specifying the model name, implementation class, and any parameters. For Triton-based models, the standard Triton model repository layout with version directories is required.
Key considerations:
- Model artifacts must include framework-specific metadata (model-settings.json for MLServer, config.pbtxt for Triton)
- Supported storage backends include Google Cloud Storage (gs://), Amazon S3 (s3://), MinIO, and local filesystem paths
- For authenticated storage, configure rclone secrets with the appropriate credentials
Step 2: Define Model Resource
Create a Seldon Model custom resource manifest (YAML) that declares the model's storage URI, framework requirements, and optional memory allocation. The requirements field determines which inference server the model is assigned to, and the memory field controls how much server memory the model is allocated.
Key considerations:
- The storageUri must point to the directory containing the model artifact and settings
- Requirements (e.g., sklearn, tensorflow, huggingface) must match available server capabilities
- Memory allocation (e.g., 100Ki) influences multi-model serving scheduling decisions
Step 3: Deploy Model
Apply the Model resource to the cluster (via kubectl or seldon CLI). The Seldon operator reconciles the resource, the scheduler assigns it to a compatible inference server, and the agent downloads the model artifact from storage to the server pod.
Key considerations:
- In Kubernetes mode, apply to the namespace where SeldonRuntime is configured (typically seldon-mesh)
- In local mode, use the seldon CLI directly against the scheduler endpoint
- The model transitions through states: ScheduleRequested, Scheduled, Loading, Loaded, Available
Step 4: Verify Readiness
Wait for the model to reach the ModelAvailable condition, confirming that the model has been downloaded, loaded into the inference server, and is ready to serve predictions. Query the model status to verify successful deployment and check for any error conditions.
Key considerations:
- Use blocking wait commands with timeout to avoid indefinite polling
- Check model status for detailed error messages if the model fails to load
- Common issues include incorrect storageUri paths, missing requirements, or insufficient server memory
Step 5: Run Inference
Send prediction requests to the deployed model using the V2 inference protocol via REST or gRPC. Requests follow the Open Inference Protocol format with named inputs containing typed tensor data.
Key considerations:
- REST endpoint follows the pattern: /v2/models/{model_name}/infer
- gRPC uses the standard V2 inference service definition
- Both REST and gRPC are available simultaneously on the same model
- Batch inference is supported by sending multiple requests in sequence
Step 6: Manage Lifecycle
Unload the model when it is no longer needed to free server resources. Models can be updated by re-applying modified manifests (e.g., changing the storageUri for a new version), which triggers a rolling update.
Key considerations:
- Unloading removes the model from the server but does not delete the artifact from storage
- Rolling updates allow zero-downtime model version changes
- Multiple model versions can coexist for gradual migration