Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Scikit learn Scikit learn Clustering

From Leeroopedia


Knowledge Sources
Domains Unsupervised Learning, Pattern Recognition
Last Updated 2026-02-08 15:00 GMT

Overview

Clustering is the task of grouping a set of objects such that objects within the same group are more similar to each other than to those in other groups.

Description

Clustering algorithms partition data into meaningful subgroups (clusters) without requiring labeled examples, making them a core unsupervised learning technique. They address the fundamental problem of discovering hidden structure in unlabeled datasets. Different algorithms make different assumptions about cluster shape, density, and connectivity, leading to a rich family of methods suited to different data characteristics. Clustering sits at the intersection of exploratory data analysis, pattern recognition, and data compression.

Usage

Use clustering when you need to discover natural groupings in data without prior labels. Common applications include customer segmentation, document grouping, image segmentation, anomaly detection (via small or singleton clusters), and as a preprocessing step for supervised learning. Choose centroid-based methods (e.g., K-Means) for globular clusters of roughly equal size, density-based methods (e.g., DBSCAN, OPTICS) when clusters have irregular shapes or noise is present, and hierarchical methods (e.g., Agglomerative Clustering) when a nested cluster hierarchy is meaningful.

Theoretical Basis

Centroid-based clustering (K-Means) minimizes within-cluster sum of squares:

J=k=1KxiCkxiμk2

where μk is the centroid of cluster Ck. The algorithm alternates between assigning points to the nearest centroid and recomputing centroids until convergence.

Density-based clustering (DBSCAN) defines clusters as contiguous regions of high density separated by regions of low density. A point is a core point if at least Failed to parse (syntax error): {\displaystyle \text{min\_samples}} points lie within distance ε. Core points that are within ε of each other are grouped together, and border points are assigned to nearby core points.

Hierarchical clustering (Agglomerative) builds a dendrogram by iteratively merging the two closest clusters according to a linkage criterion:

  • Single linkage: d(A,B)=minaA,bBd(a,b)
  • Complete linkage: d(A,B)=maxaA,bBd(a,b)
  • Ward linkage: minimizes the increase in total within-cluster variance upon merging.

Spectral clustering uses the eigenvalues of a similarity graph Laplacian to project data into a lower-dimensional space before applying K-Means, enabling discovery of non-convex clusters.

Mean Shift iteratively shifts each data point toward the mode of the local density estimated via a kernel, converging points to cluster centers without specifying the number of clusters.

BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) incrementally builds a compact summary tree (CF tree) enabling efficient clustering of very large datasets in a single pass.

Bisecting K-Means recursively splits the largest cluster using K-Means with K=2, producing a divisive hierarchical clustering.

OPTICS (Ordering Points To Identify the Clustering Structure) generalizes DBSCAN by producing an ordering of points annotated with reachability distances, enabling extraction of clusters at varying density levels.

Affinity Propagation exchanges messages between data points to simultaneously identify exemplars (cluster centers) and assign points to them, without requiring the number of clusters as input.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment