Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:DistrictDataLabs Yellowbrick Cluster Analysis

From Leeroopedia



Knowledge Sources
Domains Machine_Learning, Clustering, Unsupervised_Learning
Last Updated 2026-02-08 12:00 GMT

Overview

End-to-end process for selecting the optimal number of clusters and evaluating cluster quality using Yellowbrick's clustering visualizers.

Description

This workflow covers the standard procedure for unsupervised cluster analysis using Yellowbrick's visual diagnostic tools. The primary challenge in clustering is selecting the number of clusters (K). This workflow uses the K-Elbow method to identify the optimal K by plotting a scoring metric across cluster counts and automatically detecting the "elbow" inflection point. It then validates the choice using Silhouette analysis, which shows how well each sample fits its assigned cluster. Finally, it visualizes intercluster relationships using the Intercluster Distance Map.

Key outputs:

  • K-Elbow plot with automatic elbow detection for optimal K selection
  • Silhouette plot showing per-sample cluster membership quality
  • Intercluster distance map showing relative cluster sizes and separation

Usage

Execute this workflow when you have an unlabeled dataset and need to discover natural groupings, select the number of clusters for a K-based algorithm (KMeans, MiniBatchKMeans), or evaluate cluster quality. This is useful for customer segmentation, document grouping, anomaly detection preprocessing, or any unsupervised pattern discovery task.

Execution Steps

Step 1: Load and Preprocess Data

Load the dataset and prepare features for clustering. Clustering algorithms are sensitive to feature scaling, so standardization or normalization is typically required. Remove or encode categorical features as needed.

Key considerations:

  • Use Yellowbrick's built-in loaders (e.g., load_nfl) for experimentation
  • Standardize features to zero mean and unit variance
  • Clustering uses only X (feature matrix); y is optional for coloring/validation

Step 2: Find Optimal K with Elbow Method

Use the KElbowVisualizer to evaluate a range of K values. The visualizer fits the clustering model for each K, computes a scoring metric (distortion, silhouette, or Calinski-Harabasz), and plots the results. The built-in KneeLocator algorithm automatically identifies the elbow point where adding more clusters yields diminishing returns.

Key considerations:

  • Default metric is distortion (sum of squared distances to nearest cluster center)
  • Silhouette score and Calinski-Harabasz index are alternative metrics
  • A timing curve can be overlaid to show computational cost at each K
  • The elbow is marked with a dashed vertical line on the plot

Step 3: Validate with Silhouette Analysis

Use the SilhouetteVisualizer to assess cluster quality at the chosen K. The silhouette coefficient measures how similar each sample is to its own cluster versus neighboring clusters, ranging from -1 (wrong cluster) to +1 (well-matched).

What to look for:

  • Uniform silhouette widths across clusters indicate balanced assignments
  • Clusters with many negative values suggest misassigned points
  • The vertical dashed line shows the average silhouette score
  • Wide variation in cluster sizes may indicate suboptimal K

Step 4: Visualize Intercluster Distance

Use the InterclusterDistance visualizer to produce a 2D embedding that shows the relative size and distance between clusters. Cluster centers are projected via MDS (Multidimensional Scaling) and rendered as circles proportional to cluster membership count.

What to look for:

  • Well-separated circles indicate distinct clusters
  • Overlapping circles suggest clusters may need to be merged
  • Very small circles may indicate noise clusters or outliers

Step 5: Render and Interpret

Render all three visualizations together to form a complete picture of cluster quality. Use the elbow K as a starting point, validate with silhouette analysis, and confirm spatial separation with the intercluster distance map.

Key considerations:

  • Quick methods (kelbow_visualizer, silhouette_visualizer, intercluster_distance) enable one-liner analysis
  • Iterate by trying different K values near the elbow if silhouette analysis suggests adjustments
  • Save visualizations for reporting via show(outpath="filename.png")

Execution Diagram

GitHub URL

Workflow Repository