Implementation:DistrictDataLabs Yellowbrick InterclusterDistance Visualizer
| Knowledge Sources | |
|---|---|
| Domains | Machine_Learning, Clustering, Visualization |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Concrete tool for visualizing the relative positions and sizes of clusters via 2D embedding of cluster centroids, provided by the Yellowbrick library.
Description
InterclusterDistance is a Yellowbrick visualizer that creates an intercluster distance map by embedding high-dimensional cluster centers into a two-dimensional space and rendering each cluster as a circle whose size reflects a scoring metric (by default, cluster membership count). The embedding preserves the relative distances between cluster centers, so the spatial layout of circles in the plot corresponds to the relationships between clusters in the original feature space.
The visualizer supports two embedding algorithms: MDS (Multidimensional Scaling) and t-SNE (t-distributed Stochastic Neighbor Embedding). The only currently supported scoring metric is membership (the number of data points assigned to each cluster). Each cluster is drawn as a scatter point with an area proportional to its score, and a numeric label is placed at its center. An optional size legend displays reference circles at the 25th, 50th, and 75th percentile of scores.
The class extends ClusteringScoreVisualizer and follows Yellowbrick's standard fit() / draw() / finalize() / show() API pattern. It is also aliased as ICDM.
Usage
Use InterclusterDistance after you have chosen a value of k to visualize the resulting cluster structure. It is especially helpful for understanding whether clusters are well-separated and for identifying the relative population sizes of clusters. Import it, wrap your scikit-learn clusterer, call fit(X), and then show().
Code Reference
Source Location
- Repository: yellowbrick
- File:
yellowbrick/cluster/icdm.py - Class Definition: Lines 61-425
- Key Methods:
__init__(L164-206),fit(L279-299),draw(L301-326) - Quick Method:
intercluster_distance()(L469-599)
Signature
class InterclusterDistance(ClusteringScoreVisualizer):
def __init__(
self,
estimator,
ax=None,
min_size=400,
max_size=25000,
embedding="mds",
scoring="membership",
legend=True,
legend_loc="lower left",
legend_size=1.5,
random_state=None,
is_fitted="auto",
**kwargs
):
Import
from yellowbrick.cluster import InterclusterDistance
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| estimator | scikit-learn clusterer | Yes | A centroidal clustering estimator with cluster_centers_ and labels_ attributes (e.g., KMeans, MiniBatchKMeans).
|
| ax | matplotlib Axes | No | The axes to plot the figure on. If None, the current axes are used or generated.
|
| min_size | int | No | Minimum marker size in points for the smallest cluster. Default: 400.
|
| max_size | int | No | Maximum marker size in points for the largest cluster. Default: 25000.
|
| embedding | str | No | Dimensionality reduction algorithm for embedding cluster centers: "mds" or "tsne". Default: "mds".
|
| scoring | str | No | Scoring metric for cluster sizes: "membership" (count of assigned points). Default: "membership".
|
| legend | bool | No | Whether to draw a size legend showing reference cluster sizes. Default: True.
|
| legend_loc | str | No | Location of the size legend (any valid matplotlib legend location string). Default: "lower left".
|
| legend_size | float | No | Size of the inset legend axes in inches. Default: 1.5.
|
| random_state | int or RandomState | No | Random state for reproducibility of the embedding algorithm. Default: None.
|
| is_fitted | bool or str | No | Whether the estimator is already fitted. "auto" checks automatically. Default: "auto".
|
The fit() method accepts:
| Name | Type | Required | Description |
|---|---|---|---|
| X | array-like of shape (n_samples, n_features) | Yes | Feature matrix to cluster and visualize. |
| y | array-like of shape (n_samples,) | No | Ignored. Present for API consistency. |
Outputs
| Name | Type | Description |
|---|---|---|
| cluster_centers_ | array of shape (n_clusters, n_features) | The cluster centers retrieved from the fitted estimator. |
| embedded_centers_ | array of shape (n_clusters, 2) | The 2D positions of cluster centers after embedding. |
| scores_ | array of shape (n_clusters,) | The scoring metric values (e.g., membership counts) for each cluster. |
| fit_time_ | Timer | The elapsed time for fitting the clustering model and performing the embedding. |
Usage Examples
Basic Usage
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from yellowbrick.cluster import InterclusterDistance
# Generate synthetic data
X, y = make_blobs(n_samples=1000, n_features=12, centers=6, random_state=42)
# Instantiate the clustering model and visualizer
model = KMeans(n_clusters=6, random_state=42)
visualizer = InterclusterDistance(model)
# Fit and show the intercluster distance map
visualizer.fit(X)
visualizer.show()
Customizing the Visualization
from sklearn.cluster import KMeans
from yellowbrick.cluster import InterclusterDistance
model = KMeans(n_clusters=8, random_state=42)
visualizer = InterclusterDistance(
model,
embedding="tsne",
min_size=500,
max_size=20000,
legend_loc="upper right",
random_state=42,
)
visualizer.fit(X)
visualizer.show()
Quick Method
from sklearn.cluster import KMeans
from yellowbrick.cluster.icdm import intercluster_distance
# One-liner: creates, fits, and shows the visualizer
viz = intercluster_distance(KMeans(n_clusters=6, random_state=42), X)
Internal Workflow
The fit() method executes the following steps:
- Checks whether the wrapped estimator is already fitted (controlled by
is_fitted). If not fitted, callsestimator.fit(X, y)within a timer. - Retrieves the cluster centers from the estimator's
cluster_centers_attribute. - Applies the embedding algorithm (MDS or t-SNE) via
fit_transform()on the cluster centers to obtain 2D coordinates (embedded_centers_). - Computes the cluster scores using the specified scoring method (e.g.,
np.bincount(labels_)for membership). - Calls
draw(), which computes marker sizes from scores usingprop_to_size(), draws scatter points at the embedded coordinates, and annotates each cluster with its numeric index.
The finalize() method sets the title, configures an origin-centered grid, and optionally draws an inset size legend showing reference circles at the 25th, 50th, and 75th percentile of scores.