Workflow:Kserve Kserve Canary Rollout Deployment

Knowledge Sources	KServe KServe Website Canary Rollout Guide
Domains	ML_Serving, Kubernetes, MLOps, Traffic_Management
Last Updated	2026-02-13 14:00 GMT

Overview

End-to-end process for safely rolling out a new model version using canary traffic splitting, validation, and promotion on KServe.

Description

This workflow covers the progressive rollout of a new model version alongside an existing stable version. KServe automatically tracks the last known good revision and splits traffic between it and the new canary revision based on a configurable percentage. The process includes deploying the initial model, updating with a canary percentage, validating the new model under partial traffic, and promoting or rolling back based on observed behavior. Tag-based routing enables explicit testing of specific revisions.

Usage

Execute this workflow when you need to update a production InferenceService with a new model version while minimizing risk. Use canary rollout when you want to gradually shift traffic from the current model to a new version, validate performance under real traffic, and retain the ability to instantly roll back if issues are detected.

Execution Steps

Step 1: Deploy the initial model version

Create and apply the InferenceService with the initial model version. This establishes the baseline revision that will serve 100% of traffic. KServe assigns this as the stable revision with full traffic allocation.

Key considerations:

The initial deployment must reach Ready state before proceeding
KServe automatically tracks this as the "last rolled out" revision
Verify predictions are correct before proceeding to canary updates

Step 2: Update InferenceService with canary traffic split

Modify the InferenceService spec to point to the new model version and set the canaryTrafficPercent field to the desired initial traffic percentage (e.g., 10%). Apply the updated manifest. KServe creates a new revision and splits traffic between the previous stable revision and the new canary revision.

Key considerations:

Set canaryTrafficPercent to a small value initially (5-20%)
The storageUri changes to point to the new model artifacts
KServe automatically manages two revisions and their traffic allocation
Both old and new model pods run simultaneously during canary

Step 3: Validate canary model under partial traffic

Monitor the canary revision for correctness, latency, and error rate. Send test requests and observe that traffic is split according to the configured percentage. Use tag-based routing to send requests explicitly to the canary or previous revision for focused testing.

What happens:

Repeated requests are distributed between revisions based on weight
Enable tag-based routing via annotation serving.kserve.io/enable-tag-routing
Latest revision accessible at latest-{service}-predictor-default.{namespace}
Previous revision accessible at prev-{service}-predictor-default.{namespace}

Step 4: Gradually increase canary traffic

If the canary model performs well, incrementally increase the canaryTrafficPercent value (e.g., 10% to 50% to 100%). Apply the updated manifest at each increment and monitor for regressions. This provides controlled exposure of the new model to production traffic.

Key considerations:

Increase traffic gradually to catch issues at larger scale
Monitor error rates and latency at each increment
Keep the rollback option available at every stage

Step 5: Promote or rollback

If the canary model is validated, promote it by removing the canaryTrafficPercent field entirely. This directs 100% traffic to the new revision, which becomes the new stable revision. The old revision pods automatically scale to zero. If issues are detected, set canaryTrafficPercent to 0 to pin all traffic to the previous stable revision, effectively rolling back.

What happens on promotion:

Remove canaryTrafficPercent from the spec
All traffic shifts to the new revision
Old revision pods scale down to zero automatically

What happens on rollback:

Set canaryTrafficPercent to 0
All traffic reverts to the previous stable revision
The new revision pods scale down

Execution Diagram

GitHub URL

Workflow Repository