Principle:Kubeflow Kubeflow Post Deployment Verification
| Knowledge Sources | |
|---|---|
| Domains | Kubeflow, Platform Deployment, Verification, Operations |
| Last Updated | 2026-02-13 00:00 GMT |
Overview
Post-deployment verification is the systematic process of confirming that all Kubeflow components, infrastructure services, and user-facing endpoints are healthy and functioning correctly after installation.
Description
After completing all deployment steps (prerequisites validation, installation method selection, core infrastructure deployment, component deployment, and multi-user configuration), operators must verify the entire platform is working end-to-end. This is not merely checking pod status; it involves validating that services are routable, authentication flows work, and ML workloads can be submitted.
Post-deployment verification catches issues that may not surface during individual component deployment, such as:
- Misconfigured Istio VirtualService routing preventing Dashboard access
- Dex OIDC misconfiguration causing authentication failures
- Missing RBAC permissions in Profile namespaces blocking workload submission
- cert-manager certificate issuance failures affecting TLS termination
- Resource exhaustion on cluster nodes preventing all pods from scheduling
This principle applies to both initial deployments and upgrades. After any change to the Kubeflow platform, a full verification pass should be performed to confirm nothing has regressed.
Usage
Perform post-deployment verification in the following scenarios:
- After completing a fresh Kubeflow installation
- After upgrading any Kubeflow component or infrastructure service
- After modifying cluster-level configuration (network policies, resource quotas, node pools)
- As part of a regular operational health check cadence
- When users report issues accessing the platform or submitting workloads
Theoretical Basis
The verification process follows a layered approach, checking from infrastructure up to user experience:
Layer 1: Pod Health Check
- Query all pods across all Kubeflow-related namespaces (kubeflow, istio-system, cert-manager, auth, knative-serving, user profile namespaces)
- Verify every pod is in Running or Completed state
- Investigate any pod in CrashLoopBackOff, Error, Pending, or ImagePullBackOff state
- Check pod restart counts; high restart counts indicate instability even if the pod is currently Running
Layer 2: Service Endpoint Verification
- Verify the Istio ingress gateway service has an external IP or is accessible via NodePort
- Verify the Central Dashboard is routable through the ingress gateway
- Verify Dex is responding to OIDC discovery requests
- Verify Kubeflow Pipelines API is accessible
Layer 3: Authentication Flow Verification
- Access the Central Dashboard URL
- Confirm redirect to Dex login page
- Authenticate with test credentials
- Confirm redirect back to Dashboard with a valid session
Layer 4: Functional Smoke Test
- Select a user namespace from the Dashboard
- Verify Notebooks, Pipelines, and other component UIs are accessible
- Optionally submit a simple test pipeline or create a test notebook server
- Verify the workload completes successfully
Verification Outcome:
- If all layers pass, the deployment is confirmed healthy
- If any layer fails, investigate the specific failure before declaring the deployment complete