Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Kubeflow Kubeflow KServe InferenceService CRD

From Leeroopedia
Knowledge Sources
Domains MLOps, Model Serving, Kubernetes
Last Updated 2026-02-13 00:00 GMT

Overview

Concrete tool for deploying models as auto-scaling inference endpoints provided by KServe.

Description

The InferenceService CRD is the Kubernetes-native resource through which KServe manages model serving deployments. When an InferenceService resource is created, the KServe controller provisions the serving infrastructure including the predictor pod (running the selected serving runtime), optional transformer and explainer sidecars, Knative-based autoscaling (or Kubernetes HPA in raw deployment mode), and Istio-based traffic routing.

The InferenceService supports multiple serving runtimes out of the box: TorchServe (PyTorch), TensorFlow Serving (TensorFlow/Keras), Triton Inference Server (multi-framework), MLServer (scikit-learn, XGBoost, LightGBM), HuggingFace (Transformers models), and custom container runtimes. All runtimes expose standardized REST and gRPC endpoints conforming to the Open Inference Protocol (V2).

Canary deployments are supported natively through the canaryTrafficPercent field, allowing gradual traffic migration between model versions. The CRD also supports model explainability via integrated explainer components (ART Explainer, Alibi Explainer).

External Reference

Usage

Use the InferenceService CRD when:

  • A trained model must be deployed for real-time REST or gRPC inference.
  • Auto-scaling (including scale-to-zero) is required for cost-efficient serving.
  • Canary or blue-green deployment is needed for safe model version rollouts.
  • Pre-processing or post-processing transformations must be co-deployed with the model.
  • Model explainability must be available alongside prediction responses.
  • The serving deployment must integrate with Kubeflow Model Registry for model versioning.

Code Reference

Source Location

  • Repository: kserve/kserve
  • File: config/crd/serving.kserve.io_inferenceservices.yaml (CRD schema)

Signature

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: <service-name>
  namespace: <namespace>
spec:
  predictor:
    model:
      modelFormat:
        name: <pytorch|tensorflow|sklearn|xgboost|onnx|triton|huggingface>
      storageUri: <model-storage-uri>
      resources:
        requests:
          cpu: "<cpu>"
          memory: "<memory>"
          nvidia.com/gpu: "<gpu-count>"
    minReplicas: <min-replicas>
    maxReplicas: <max-replicas>
    scaleTarget: <target-concurrency>
  transformer:
    containers:
      - name: <transformer-name>
        image: <transformer-image>
  explainer:
    alibi:
      type: <AnchorTabular|AnchorImages|AnchorText|...>
      storageUri: <explainer-model-uri>
  canaryTrafficPercent: <0-100>

Import

# Install KServe
kubectl apply -f https://github.com/kserve/kserve/releases/download/v0.14.0/kserve.yaml
kubectl apply -f https://github.com/kserve/kserve/releases/download/v0.14.0/kserve-cluster-resources.yaml

# Deploy an InferenceService
kubectl apply -f inferenceservice.yaml

I/O Contract

Inputs

Name Type Required Description
metadata.name string Yes Name of the InferenceService resource
metadata.namespace string Yes Kubernetes namespace for the serving deployment
spec.predictor.model.modelFormat.name string Yes Model framework format (pytorch, tensorflow, sklearn, etc.)
spec.predictor.model.storageUri string Yes Storage URI pointing to the model artifacts
spec.predictor.model.resources object Yes CPU, memory, and GPU resource requests and limits
spec.predictor.minReplicas integer No Minimum number of replicas (0 enables scale-to-zero)
spec.predictor.maxReplicas integer No Maximum number of replicas for autoscaling
spec.transformer object No Pre/post-processing transformer container configuration
spec.explainer object No Model explainability component configuration
canaryTrafficPercent integer No Percentage of traffic to route to the canary (latest) revision

Outputs

Name Type Description
Inference endpoint URL string REST and gRPC endpoint URL for sending prediction requests
Prediction response JSON Model prediction output conforming to Open Inference Protocol
Auto-scaling deployment Knative Service or Deployment Managed replicas that scale based on request concurrency
Traffic routing Istio VirtualService Traffic split configuration between model versions
InferenceService status InferenceService.status Ready condition, URL, traffic split, and component statuses

Usage Examples

Basic Usage

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: fraud-detector
  namespace: ml-serving
spec:
  predictor:
    model:
      modelFormat:
        name: pytorch
      storageUri: "s3://ml-models/fraud-detector/v2/model.pt"
      resources:
        requests:
          cpu: "2"
          memory: "4Gi"
        limits:
          cpu: "4"
          memory: "8Gi"
    minReplicas: 1
    maxReplicas: 10
    scaleTarget: 5

Canary Deployment with Transformer

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: recommendation-engine
  namespace: ml-serving
spec:
  predictor:
    model:
      modelFormat:
        name: tensorflow
      storageUri: "gs://ml-models/recommendation/v3"
      resources:
        requests:
          cpu: "4"
          memory: "8Gi"
          nvidia.com/gpu: "1"
    minReplicas: 2
    maxReplicas: 20
  transformer:
    containers:
      - name: feature-transformer
        image: my-registry/feature-transformer:v2
        resources:
          requests:
            cpu: "1"
            memory: "2Gi"
  canaryTrafficPercent: 20

Sending a Prediction Request

# REST prediction request using the Open Inference Protocol V2
curl -X POST \
  "https://fraud-detector.ml-serving.example.com/v2/models/fraud-detector/infer" \
  -H "Content-Type: application/json" \
  -d '{
    "inputs": [
      {
        "name": "input-0",
        "shape": [1, 30],
        "datatype": "FP32",
        "data": [0.1, 0.2, 0.3, ...]
      }
    ]
  }'

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment