Principle:DistrictDataLabs Yellowbrick Intercluster Distance Mapping
| Knowledge Sources | |
|---|---|
| Domains | Machine_Learning, Clustering, Model_Evaluation |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Intercluster Distance Mapping is a technique for visualizing the relative positions and sizes of clusters by embedding their high-dimensional centroids into a two-dimensional space while preserving inter-centroid distances.
Description
When working with clustering algorithms in high-dimensional feature spaces, it is difficult to assess whether clusters are well-separated or overlapping. Intercluster Distance Mapping addresses this by projecting cluster centers from their original high-dimensional space into two dimensions using a dimensionality reduction technique. The projection preserves the relative distances between cluster centers: clusters that are close together in the original feature space appear close in the 2D embedding, and those that are far apart appear distant.
In the resulting visualization, each cluster is represented as a circle positioned at its embedded center. The size of each circle encodes a scoring metric -- typically membership (the count of data points assigned to that cluster). This gives an immediate sense of both the spatial relationships between clusters and their relative importance. Large, well-separated circles indicate a healthy clustering with distinct, well-populated groups. Overlapping circles may suggest that those clusters are difficult to distinguish in the feature space, though it is important to note that overlap in the 2D embedding does not necessarily imply overlap in the original feature space due to the information loss inherent in dimensionality reduction.
Two embedding algorithms are commonly used: Multidimensional Scaling (MDS), which directly minimizes the stress between pairwise distances in the original and embedded spaces, and t-SNE, which preserves local neighborhood structure through a probabilistic approach. MDS is the default choice because it is deterministic (given a fixed random state) and emphasizes global distance preservation, making it well-suited for showing how cluster centers relate to one another.
Usage
Use Intercluster Distance Mapping when:
- You want to understand the spatial relationships between cluster centers in a high-dimensional space.
- You need to identify clusters that may be too close together and potentially redundant or poorly separated.
- You want to visualize the relative sizes (memberships) of clusters to assess cluster balance.
- You have already determined a value of k and want to evaluate the resulting clustering structure.
Limitations:
- The 2D embedding inevitably loses information from the original high-dimensional space. Overlap in the visualization does not prove overlap in feature space.
- Requires the clustering algorithm to produce explicit
cluster_centers_(e.g., k-means, mini-batch k-means). Hierarchical or density-based methods may not be directly supported. - The embedding can be sensitive to the random state, especially with t-SNE.
Theoretical Basis
Dimensionality Reduction of Cluster Centers
Given cluster centers in , the goal is to find a mapping such that pairwise distances are approximately preserved:
Multidimensional Scaling (MDS)
MDS achieves this by minimizing a stress function. Classical MDS minimizes:
where is the distance between centers and in the original space and is the distance in the embedded 2D space. The result is a set of 2D coordinates that best preserves the original inter-centroid distances.
Cluster Sizing by Membership
Each cluster's visual size is determined by its membership count (the number of data points assigned to it):
These counts are scaled to marker areas using a proportional sizing function that maps raw scores to a range between a minimum and maximum marker size, providing an intuitive representation of relative cluster populations.