Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Kubeflow Kubeflow Platform Deployment

From Leeroopedia
Knowledge Sources
Domains MLOps, Kubernetes, Platform_Engineering
Last Updated 2026-02-13 14:00 GMT

Overview

End-to-end process for deploying the Kubeflow AI reference platform on a Kubernetes cluster using either Kubeflow Manifests or Packaged Distributions.

Description

This workflow covers the complete deployment of Kubeflow onto a Kubernetes cluster. Kubeflow can be installed in two ways: via Kubeflow Manifests (the community-maintained kustomize overlays in the kubeflow/manifests repository) or via Packaged Distributions (vendor-specific installers from providers like AWS, Google Cloud, or other certified distributors). The workflow addresses prerequisites, cluster preparation, installation method selection, component deployment, multi-user configuration, and post-deployment verification.

Usage

Execute this workflow when you need to provision a Kubeflow AI platform on Kubernetes for your organization. This applies when you have an existing Kubernetes cluster (v1.29+) and want to deploy the full Kubeflow suite (Pipelines, Notebooks, Training, Serving, Katib, Model Registry) or a subset of components for your AI/ML teams.

Execution Steps

Step 1: Prerequisites Validation

Verify that all infrastructure prerequisites are met before beginning installation. This includes confirming the Kubernetes cluster version compatibility, ensuring sufficient cluster resources (CPU, memory, storage), validating network policies, and checking that required tools are installed (kubectl, kustomize or the distribution-specific CLI).

Key considerations:

  • Kubernetes version must be 1.29 or later for Kubeflow 1.10+
  • A default StorageClass must be configured for persistent volumes
  • ISTIO service mesh is required for the full platform
  • Sufficient node capacity for all Kubeflow control plane components

Step 2: Installation Method Selection

Choose between Kubeflow Manifests (community upstream) and a Packaged Distribution (vendor-managed). The manifests approach gives full control and customization but requires more operational expertise. Packaged Distributions offer simplified installation, vendor support, and cloud-native integrations at the cost of some flexibility.

Key considerations:

  • Manifests are maintained by the community in the kubeflow/manifests repository
  • Packaged Distributions are available from AWS, Google Cloud, and other vendors
  • For production environments, a Packaged Distribution may reduce operational burden
  • For development or customization, Manifests provide the most flexibility

Step 3: Core Infrastructure Deployment

Deploy the foundational infrastructure components that Kubeflow depends on. This includes the Istio service mesh, cert-manager for TLS certificates, Dex for authentication (or the cloud provider's identity service), and the central dashboard with profile controller.

Key considerations:

  • Istio must be deployed before Kubeflow components
  • cert-manager handles automatic TLS certificate provisioning
  • Dex provides a reference OpenID Connect implementation for AuthN
  • The profile controller manages per-user namespace isolation

Step 4: Component Deployment

Install the desired Kubeflow sub-projects. Each component can be deployed independently or as part of the full platform. The core components include Kubeflow Pipelines, Notebooks (Workbenches), Trainer, KServe, Katib, Model Registry, and Spark Operator.

What is deployed:

  • Kubeflow Pipelines for workflow orchestration
  • Notebooks for interactive development environments
  • Trainer for distributed model training jobs
  • KServe for model serving and inference
  • Katib for hyperparameter tuning and neural architecture search
  • Model Registry for model versioning and tracking
  • Spark Operator for large-scale data processing

Step 5: Multi_user Configuration

Configure multi-user isolation and access controls. This involves setting up user profiles (namespaces), configuring RBAC policies, integrating with the organization's identity provider, and establishing resource quotas per team or user.

Key considerations:

  • Each user or team gets an isolated Kubernetes namespace
  • PodSecurityStandards are enforced (restricted for system, baseline for user namespaces)
  • RBAC policies control access to shared resources like pipelines and models
  • Integration with external identity providers via OIDC

Step 6: Post_deployment Verification

Validate that all deployed components are healthy and accessible. This includes checking pod status across all Kubeflow namespaces, verifying the Central Dashboard is reachable, testing authentication flows, confirming pipeline execution works, and running a sample notebook and training job.

Key considerations:

  • All pods in kubeflow namespace should be Running or Completed
  • Central Dashboard should be accessible via the configured ingress
  • Authentication and authorization should work for test users
  • Run a sample pipeline to validate end-to-end functionality

Execution Diagram

GitHub URL

Workflow Repository