Principle:Scikit learn Scikit learn Clustering

Knowledge Sources	Scikit_learn Scikit-learn Docs
Domains	Unsupervised Learning, Pattern Recognition
Last Updated	2026-02-08 15:00 GMT

Overview

Clustering is the task of grouping a set of objects such that objects within the same group are more similar to each other than to those in other groups.

Description

Clustering algorithms partition data into meaningful subgroups (clusters) without requiring labeled examples, making them a core unsupervised learning technique. They address the fundamental problem of discovering hidden structure in unlabeled datasets. Different algorithms make different assumptions about cluster shape, density, and connectivity, leading to a rich family of methods suited to different data characteristics. Clustering sits at the intersection of exploratory data analysis, pattern recognition, and data compression.

Usage

Use clustering when you need to discover natural groupings in data without prior labels. Common applications include customer segmentation, document grouping, image segmentation, anomaly detection (via small or singleton clusters), and as a preprocessing step for supervised learning. Choose centroid-based methods (e.g., K-Means) for globular clusters of roughly equal size, density-based methods (e.g., DBSCAN, OPTICS) when clusters have irregular shapes or noise is present, and hierarchical methods (e.g., Agglomerative Clustering) when a nested cluster hierarchy is meaningful.

Theoretical Basis

Centroid-based clustering (K-Means) minimizes within-cluster sum of squares:

$J = \sum_{k = 1}^{K} \sum_{x_{i} \in C_{k}} ‖ x_{i} - μ_{k} ‖^{2}$

where $μ_{k}$ is the centroid of cluster $C_{k}$ . The algorithm alternates between assigning points to the nearest centroid and recomputing centroids until convergence.

Density-based clustering (DBSCAN) defines clusters as contiguous regions of high density separated by regions of low density. A point is a core point if at least Failed to parse (syntax error): {\displaystyle \text{min\_samples}} points lie within distance $ε$ . Core points that are within $ε$ of each other are grouped together, and border points are assigned to nearby core points.

Hierarchical clustering (Agglomerative) builds a dendrogram by iteratively merging the two closest clusters according to a linkage criterion:

Single linkage: $d (A, B) = \min_{a \in A, b \in B} d (a, b)$
Complete linkage: $d (A, B) = \max_{a \in A, b \in B} d (a, b)$
Ward linkage: minimizes the increase in total within-cluster variance upon merging.

Spectral clustering uses the eigenvalues of a similarity graph Laplacian to project data into a lower-dimensional space before applying K-Means, enabling discovery of non-convex clusters.

Mean Shift iteratively shifts each data point toward the mode of the local density estimated via a kernel, converging points to cluster centers without specifying the number of clusters.

BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) incrementally builds a compact summary tree (CF tree) enabling efficient clustering of very large datasets in a single pass.

Bisecting K-Means recursively splits the largest cluster using K-Means with $K = 2$ , producing a divisive hierarchical clustering.

OPTICS (Ordering Points To Identify the Clustering Structure) generalizes DBSCAN by producing an ordering of points annotated with reachability distances, enabling extraction of clusters at varying density levels.

Affinity Propagation exchanges messages between data points to simultaneously identify exemplars (cluster centers) and assign points to them, without requiring the number of clusters as input.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment