Heuristic:Kubeflow Kubeflow Sequential Infrastructure Deployment
| Knowledge Sources | |
|---|---|
| Domains | Platform_Deployment, Infrastructure, Kubernetes |
| Last Updated | 2026-02-13 00:00 GMT |
Overview
Deploy Kubeflow infrastructure services in strict order (cert-manager, then Istio, then Dex) with health checks between each phase.
Description
Kubeflow infrastructure services have hard dependencies on each other. cert-manager must be fully ready before Istio can use it for certificate issuance. Istio must have its CRDs registered and control plane running before Dex can create VirtualService resources. Dex must be running before any Kubeflow application component that requires OIDC authentication. Deploying these services out of order or without waiting for readiness causes intermittent failures that are difficult to diagnose.
Usage
Use this heuristic whenever deploying the Kubeflow AI Reference Platform from manifests. It applies to both fresh installations and upgrades. The most common deployment failure in Kubeflow is applying all manifests at once without waiting for dependencies, resulting in webhook errors, missing CRDs, or OIDC misconfiguration.
The Insight (Rule of Thumb)
- Action: Deploy in strict phases: (1) cert-manager → wait → (2) Istio CRDs → Istio namespace → Istio control plane → wait → (3) Dex → wait → (4) Kubeflow components.
- Value: Use
kubectl wait --for=condition=readybetween each phase with appropriate timeouts (180-300 seconds). - Trade-off: Sequential deployment is slower than parallel apply but eliminates race conditions and retry loops.
- Anti-pattern: Do not use a single
kustomize build . | kubectl apply -f -for the entire manifests directory. This applies all resources simultaneously and causes dependency failures.
Reasoning
Kubernetes controllers process resources asynchronously. When a resource references a CRD that hasn't been registered yet, or when a webhook references a pod that isn't ready, the API server returns errors. These errors are transient and may eventually resolve with retries, but they create an unreliable deployment experience and can leave the cluster in a partially-deployed state.
The sequential approach ensures each prerequisite is fully operational before its dependents are created:
# Phase 1: cert-manager (no dependencies)
kustomize build common/cert-manager/cert-manager/base | kubectl apply -f -
kubectl wait --for=condition=ready pod -l app=cert-manager -n cert-manager --timeout=180s
kubectl wait --for=condition=ready pod -l app=webhook -n cert-manager --timeout=180s
# Phase 2: Istio (depends on cert-manager for TLS)
kustomize build common/istio-1-22/istio-crds/base | kubectl apply -f -
kustomize build common/istio-1-22/istio-namespace/base | kubectl apply -f -
kustomize build common/istio-1-22/istio-install/overlays/helm | kubectl apply -f -
kubectl wait --for=condition=ready pod -l app=istiod -n istio-system --timeout=300s
# Phase 3: Dex (depends on Istio for VirtualService)
kustomize build common/dex/overlays/istio | kubectl apply -f -
kubectl wait --for=condition=ready pod -l app=dex -n auth --timeout=180s
The 300-second timeout for Istio is intentionally longer than the 180-second timeout for cert-manager and Dex because the Istio control plane has more init work to do, including CRD registration and webhook setup.