Implementation:Kubeflow Kubeflow KServe InferenceService CRD

Knowledge Sources	KServe README
Domains	MLOps, Model Serving, Kubernetes
Last Updated	2026-02-13 00:00 GMT

Overview

Concrete tool for deploying models as auto-scaling inference endpoints provided by KServe.

Description

The InferenceService CRD is the Kubernetes-native resource through which KServe manages model serving deployments. When an InferenceService resource is created, the KServe controller provisions the serving infrastructure including the predictor pod (running the selected serving runtime), optional transformer and explainer sidecars, Knative-based autoscaling (or Kubernetes HPA in raw deployment mode), and Istio-based traffic routing.

The InferenceService supports multiple serving runtimes out of the box: TorchServe (PyTorch), TensorFlow Serving (TensorFlow/Keras), Triton Inference Server (multi-framework), MLServer (scikit-learn, XGBoost, LightGBM), HuggingFace (Transformers models), and custom container runtimes. All runtimes expose standardized REST and gRPC endpoints conforming to the Open Inference Protocol (V2).

Canary deployments are supported natively through the canaryTrafficPercent field, allowing gradual traffic migration between model versions. The CRD also supports model explainability via integrated explainer components (ART Explainer, Alibi Explainer).

External Reference

Usage

Use the InferenceService CRD when:

A trained model must be deployed for real-time REST or gRPC inference.
Auto-scaling (including scale-to-zero) is required for cost-efficient serving.
Canary or blue-green deployment is needed for safe model version rollouts.
Pre-processing or post-processing transformations must be co-deployed with the model.
Model explainability must be available alongside prediction responses.
The serving deployment must integrate with Kubeflow Model Registry for model versioning.

Code Reference

Source Location

Repository: kserve/kserve
File: config/crd/serving.kserve.io_inferenceservices.yaml (CRD schema)

Signature

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: <service-name>
  namespace: <namespace>
spec:
  predictor:
    model:
      modelFormat:
        name: <pytorch|tensorflow|sklearn|xgboost|onnx|triton|huggingface>
      storageUri: <model-storage-uri>
      resources:
        requests:
          cpu: "<cpu>"
          memory: "<memory>"
          nvidia.com/gpu: "<gpu-count>"
    minReplicas: <min-replicas>
    maxReplicas: <max-replicas>
    scaleTarget: <target-concurrency>
  transformer:
    containers:
      - name: <transformer-name>
        image: <transformer-image>
  explainer:
    alibi:
      type: <AnchorTabular|AnchorImages|AnchorText|...>
      storageUri: <explainer-model-uri>
  canaryTrafficPercent: <0-100>

Import

# Install KServe
kubectl apply -f https://github.com/kserve/kserve/releases/download/v0.14.0/kserve.yaml
kubectl apply -f https://github.com/kserve/kserve/releases/download/v0.14.0/kserve-cluster-resources.yaml

# Deploy an InferenceService
kubectl apply -f inferenceservice.yaml

I/O Contract

Inputs

Name	Type	Required	Description
metadata.name	string	Yes	Name of the InferenceService resource
metadata.namespace	string	Yes	Kubernetes namespace for the serving deployment
spec.predictor.model.modelFormat.name	string	Yes	Model framework format (pytorch, tensorflow, sklearn, etc.)
spec.predictor.model.storageUri	string	Yes	Storage URI pointing to the model artifacts
spec.predictor.model.resources	object	Yes	CPU, memory, and GPU resource requests and limits
spec.predictor.minReplicas	integer	No	Minimum number of replicas (0 enables scale-to-zero)
spec.predictor.maxReplicas	integer	No	Maximum number of replicas for autoscaling
spec.transformer	object	No	Pre/post-processing transformer container configuration
spec.explainer	object	No	Model explainability component configuration
canaryTrafficPercent	integer	No	Percentage of traffic to route to the canary (latest) revision

Outputs

Name	Type	Description
Inference endpoint URL	string	REST and gRPC endpoint URL for sending prediction requests
Prediction response	JSON	Model prediction output conforming to Open Inference Protocol
Auto-scaling deployment	Knative Service or Deployment	Managed replicas that scale based on request concurrency
Traffic routing	Istio VirtualService	Traffic split configuration between model versions
InferenceService status	InferenceService.status	Ready condition, URL, traffic split, and component statuses

Usage Examples

Basic Usage

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: fraud-detector
  namespace: ml-serving
spec:
  predictor:
    model:
      modelFormat:
        name: pytorch
      storageUri: "s3://ml-models/fraud-detector/v2/model.pt"
      resources:
        requests:
          cpu: "2"
          memory: "4Gi"
        limits:
          cpu: "4"
          memory: "8Gi"
    minReplicas: 1
    maxReplicas: 10
    scaleTarget: 5

Canary Deployment with Transformer

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: recommendation-engine
  namespace: ml-serving
spec:
  predictor:
    model:
      modelFormat:
        name: tensorflow
      storageUri: "gs://ml-models/recommendation/v3"
      resources:
        requests:
          cpu: "4"
          memory: "8Gi"
          nvidia.com/gpu: "1"
    minReplicas: 2
    maxReplicas: 20
  transformer:
    containers:
      - name: feature-transformer
        image: my-registry/feature-transformer:v2
        resources:
          requests:
            cpu: "1"
            memory: "2Gi"
  canaryTrafficPercent: 20

Sending a Prediction Request

# REST prediction request using the Open Inference Protocol V2
curl -X POST \
  "https://fraud-detector.ml-serving.example.com/v2/models/fraud-detector/infer" \
  -H "Content-Type: application/json" \
  -d '{
    "inputs": [
      {
        "name": "input-0",
        "shape": [1, 30],
        "datatype": "FP32",
        "data": [0.1, 0.2, 0.3, ...]
      }
    ]
  }'

Related Pages

Implements Principle

Principle:Kubeflow_Kubeflow_Serve_Model

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment