Workflow:Kserve Kserve Deploying InferenceService

Knowledge Sources	KServe KServe Website InferenceService API
Domains	ML_Serving, Kubernetes, Model_Deployment
Last Updated	2026-02-13 14:00 GMT

Overview

End-to-end process for deploying a machine learning model as a scalable inference endpoint on Kubernetes using KServe InferenceService.

Description

This workflow covers the standard procedure for deploying a trained ML model to production using KServe's InferenceService custom resource. It supports multiple model frameworks (TensorFlow, PyTorch, scikit-learn, XGBoost, LightGBM, PMML, PaddlePaddle, ONNX) and storage backends (S3, GCS, Azure Blob, PVC, HuggingFace Hub). The process spans from writing the InferenceService YAML specification through to verifying the running prediction endpoint. KServe handles model download via storage initializer, serving runtime selection, autoscaling, and ingress routing automatically.

Usage

Execute this workflow when you have a trained model stored in a supported storage backend (S3, GCS, Azure, PVC, or HuggingFace Hub) and need to expose it as an HTTP or gRPC prediction endpoint on a Kubernetes cluster with KServe installed. This is the foundational workflow for all predictive AI serving on KServe.

Execution Steps

Step 1: Prepare storage credentials

Configure Kubernetes Secrets containing the credentials required to access the model storage backend. This includes S3 access keys, GCS service account JSON, Azure storage keys, or HuggingFace tokens. Attach the secret to a ServiceAccount that will be referenced by the InferenceService, or use the default storage config secret in the KServe namespace.

Key considerations:

Each storage backend has a specific secret format expected by KServe
The ServiceAccount must be in the same namespace as the InferenceService
For public models (e.g., public GCS buckets), credentials may not be required

Step 2: Write InferenceService specification

Author the InferenceService YAML manifest specifying the model framework, storage URI pointing to the model artifacts, and any resource requirements. The spec includes the predictor component with framework-specific fields (e.g., tensorflow, pytorch, sklearn) and the storageUri field. Optionally configure transformer and explainer components.

Key considerations:

Choose the correct framework field matching your model type
The storageUri must point to the directory containing the model artifacts
Resource requests and limits determine GPU/CPU/memory allocation per replica
Protocol version (v1 or v2) determines the prediction API format

Step 3: Apply the InferenceService to Kubernetes

Submit the InferenceService manifest to the Kubernetes cluster using kubectl. The KServe controller manager receives the resource via webhook admission, applies default values (serving runtime, resource limits, timeout), validates the spec, and begins the reconciliation loop.

What happens internally:

Mutating webhook injects defaults from the inferenceservice-config ConfigMap
Validating webhook checks the spec for correctness
Controller selects the appropriate ClusterServingRuntime or ServingRuntime
Storage initializer init-container is injected to download the model

Step 4: Wait for model download and readiness

The storage initializer init-container downloads model artifacts from the specified storageUri to a local volume. Once the download completes, the model server container starts and loads the model into memory. The InferenceService transitions to Ready state when the model server passes health checks.

Key considerations:

Large models may require significant download time
Monitor pod events for download failures or storage credential issues
The READY column in kubectl get isvc shows True when the service is ready
Check init-container logs for download progress and errors

Step 5: Determine the ingress endpoint

Retrieve the external URL for the deployed InferenceService. With Knative (serverless mode), the URL is assigned automatically via the Istio ingress gateway. In raw deployment mode, a ClusterIP service is created. Extract the hostname and configure routing headers for external access.

What happens:

For Knative mode: URL available in the InferenceService status field
For raw mode: ClusterIP service created, optionally with Ingress
The service hostname must be passed as a Host header when accessing via ingress gateway

Step 6: Send prediction request and verify

Send a test prediction request to the inference endpoint using curl or a client library. The request format depends on the protocol version: v1 uses the TensorFlow Serving prediction format, v2 uses the Open Inference Protocol. Verify the response contains expected predictions.

Key considerations:

V1 protocol endpoint: /v1/models/{model_name}:predict
V2 protocol endpoint: /v2/models/{model_name}/infer
gRPC endpoints are available when configured with h2c port

Execution Diagram

GitHub URL

Workflow Repository