Implementation:Kubeflow Kubeflow KServe InferenceService CRD
| Knowledge Sources | |
|---|---|
| Domains | MLOps, Model Serving, Kubernetes |
| Last Updated | 2026-02-13 00:00 GMT |
Overview
Concrete tool for deploying models as auto-scaling inference endpoints provided by KServe.
Description
The InferenceService CRD is the Kubernetes-native resource through which KServe manages model serving deployments. When an InferenceService resource is created, the KServe controller provisions the serving infrastructure including the predictor pod (running the selected serving runtime), optional transformer and explainer sidecars, Knative-based autoscaling (or Kubernetes HPA in raw deployment mode), and Istio-based traffic routing.
The InferenceService supports multiple serving runtimes out of the box: TorchServe (PyTorch), TensorFlow Serving (TensorFlow/Keras), Triton Inference Server (multi-framework), MLServer (scikit-learn, XGBoost, LightGBM), HuggingFace (Transformers models), and custom container runtimes. All runtimes expose standardized REST and gRPC endpoints conforming to the Open Inference Protocol (V2).
Canary deployments are supported natively through the canaryTrafficPercent field, allowing gradual traffic migration between model versions. The CRD also supports model explainability via integrated explainer components (ART Explainer, Alibi Explainer).
External Reference
Usage
Use the InferenceService CRD when:
- A trained model must be deployed for real-time REST or gRPC inference.
- Auto-scaling (including scale-to-zero) is required for cost-efficient serving.
- Canary or blue-green deployment is needed for safe model version rollouts.
- Pre-processing or post-processing transformations must be co-deployed with the model.
- Model explainability must be available alongside prediction responses.
- The serving deployment must integrate with Kubeflow Model Registry for model versioning.
Code Reference
Source Location
- Repository: kserve/kserve
- File: config/crd/serving.kserve.io_inferenceservices.yaml (CRD schema)
Signature
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: <service-name>
namespace: <namespace>
spec:
predictor:
model:
modelFormat:
name: <pytorch|tensorflow|sklearn|xgboost|onnx|triton|huggingface>
storageUri: <model-storage-uri>
resources:
requests:
cpu: "<cpu>"
memory: "<memory>"
nvidia.com/gpu: "<gpu-count>"
minReplicas: <min-replicas>
maxReplicas: <max-replicas>
scaleTarget: <target-concurrency>
transformer:
containers:
- name: <transformer-name>
image: <transformer-image>
explainer:
alibi:
type: <AnchorTabular|AnchorImages|AnchorText|...>
storageUri: <explainer-model-uri>
canaryTrafficPercent: <0-100>
Import
# Install KServe
kubectl apply -f https://github.com/kserve/kserve/releases/download/v0.14.0/kserve.yaml
kubectl apply -f https://github.com/kserve/kserve/releases/download/v0.14.0/kserve-cluster-resources.yaml
# Deploy an InferenceService
kubectl apply -f inferenceservice.yaml
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| metadata.name | string | Yes | Name of the InferenceService resource |
| metadata.namespace | string | Yes | Kubernetes namespace for the serving deployment |
| spec.predictor.model.modelFormat.name | string | Yes | Model framework format (pytorch, tensorflow, sklearn, etc.) |
| spec.predictor.model.storageUri | string | Yes | Storage URI pointing to the model artifacts |
| spec.predictor.model.resources | object | Yes | CPU, memory, and GPU resource requests and limits |
| spec.predictor.minReplicas | integer | No | Minimum number of replicas (0 enables scale-to-zero) |
| spec.predictor.maxReplicas | integer | No | Maximum number of replicas for autoscaling |
| spec.transformer | object | No | Pre/post-processing transformer container configuration |
| spec.explainer | object | No | Model explainability component configuration |
| canaryTrafficPercent | integer | No | Percentage of traffic to route to the canary (latest) revision |
Outputs
| Name | Type | Description |
|---|---|---|
| Inference endpoint URL | string | REST and gRPC endpoint URL for sending prediction requests |
| Prediction response | JSON | Model prediction output conforming to Open Inference Protocol |
| Auto-scaling deployment | Knative Service or Deployment | Managed replicas that scale based on request concurrency |
| Traffic routing | Istio VirtualService | Traffic split configuration between model versions |
| InferenceService status | InferenceService.status | Ready condition, URL, traffic split, and component statuses |
Usage Examples
Basic Usage
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: fraud-detector
namespace: ml-serving
spec:
predictor:
model:
modelFormat:
name: pytorch
storageUri: "s3://ml-models/fraud-detector/v2/model.pt"
resources:
requests:
cpu: "2"
memory: "4Gi"
limits:
cpu: "4"
memory: "8Gi"
minReplicas: 1
maxReplicas: 10
scaleTarget: 5
Canary Deployment with Transformer
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: recommendation-engine
namespace: ml-serving
spec:
predictor:
model:
modelFormat:
name: tensorflow
storageUri: "gs://ml-models/recommendation/v3"
resources:
requests:
cpu: "4"
memory: "8Gi"
nvidia.com/gpu: "1"
minReplicas: 2
maxReplicas: 20
transformer:
containers:
- name: feature-transformer
image: my-registry/feature-transformer:v2
resources:
requests:
cpu: "1"
memory: "2Gi"
canaryTrafficPercent: 20
Sending a Prediction Request
# REST prediction request using the Open Inference Protocol V2
curl -X POST \
"https://fraud-detector.ml-serving.example.com/v2/models/fraud-detector/infer" \
-H "Content-Type: application/json" \
-d '{
"inputs": [
{
"name": "input-0",
"shape": [1, 30],
"datatype": "FP32",
"data": [0.1, 0.2, 0.3, ...]
}
]
}'