Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Kubeflow Kubeflow Sequential Infrastructure Deployment

From Leeroopedia
Knowledge Sources
Domains Platform_Deployment, Infrastructure, Kubernetes
Last Updated 2026-02-13 00:00 GMT

Overview

Deploy Kubeflow infrastructure services in strict order (cert-manager, then Istio, then Dex) with health checks between each phase.

Description

Kubeflow infrastructure services have hard dependencies on each other. cert-manager must be fully ready before Istio can use it for certificate issuance. Istio must have its CRDs registered and control plane running before Dex can create VirtualService resources. Dex must be running before any Kubeflow application component that requires OIDC authentication. Deploying these services out of order or without waiting for readiness causes intermittent failures that are difficult to diagnose.

Usage

Use this heuristic whenever deploying the Kubeflow AI Reference Platform from manifests. It applies to both fresh installations and upgrades. The most common deployment failure in Kubeflow is applying all manifests at once without waiting for dependencies, resulting in webhook errors, missing CRDs, or OIDC misconfiguration.

The Insight (Rule of Thumb)

  • Action: Deploy in strict phases: (1) cert-manager → wait → (2) Istio CRDs → Istio namespace → Istio control plane → wait → (3) Dex → wait → (4) Kubeflow components.
  • Value: Use kubectl wait --for=condition=ready between each phase with appropriate timeouts (180-300 seconds).
  • Trade-off: Sequential deployment is slower than parallel apply but eliminates race conditions and retry loops.
  • Anti-pattern: Do not use a single kustomize build . | kubectl apply -f - for the entire manifests directory. This applies all resources simultaneously and causes dependency failures.

Reasoning

Kubernetes controllers process resources asynchronously. When a resource references a CRD that hasn't been registered yet, or when a webhook references a pod that isn't ready, the API server returns errors. These errors are transient and may eventually resolve with retries, but they create an unreliable deployment experience and can leave the cluster in a partially-deployed state.

The sequential approach ensures each prerequisite is fully operational before its dependents are created:

# Phase 1: cert-manager (no dependencies)
kustomize build common/cert-manager/cert-manager/base | kubectl apply -f -
kubectl wait --for=condition=ready pod -l app=cert-manager -n cert-manager --timeout=180s
kubectl wait --for=condition=ready pod -l app=webhook -n cert-manager --timeout=180s

# Phase 2: Istio (depends on cert-manager for TLS)
kustomize build common/istio-1-22/istio-crds/base | kubectl apply -f -
kustomize build common/istio-1-22/istio-namespace/base | kubectl apply -f -
kustomize build common/istio-1-22/istio-install/overlays/helm | kubectl apply -f -
kubectl wait --for=condition=ready pod -l app=istiod -n istio-system --timeout=300s

# Phase 3: Dex (depends on Istio for VirtualService)
kustomize build common/dex/overlays/istio | kubectl apply -f -
kubectl wait --for=condition=ready pod -l app=dex -n auth --timeout=180s

The 300-second timeout for Istio is intentionally longer than the 180-second timeout for cert-manager and Dex because the Istio control plane has more init work to do, including CRD registration and webhook setup.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment