Implementation:Scikit learn Scikit learn KMeans
| Knowledge Sources | |
|---|---|
| Domains | Clustering, Centroid-Based Clustering |
| Last Updated | 2026-02-08 15:00 GMT |
Overview
Concrete tool for performing K-Means clustering provided by scikit-learn.
Description
KMeans is one of the most widely used clustering algorithms. It partitions n samples into k clusters by iteratively assigning each sample to the nearest cluster center and recomputing cluster centers as the mean of assigned samples. The implementation supports both Lloyd's and Elkan's algorithms and provides smart initialization via k-means++ for faster convergence. It inherits from _BaseKMeans and also implements the TransformerMixin interface for transforming data to cluster-distance space.
Usage
Use KMeans when you need a fast, general-purpose clustering algorithm with a known number of clusters. It works best when clusters are roughly spherical and of similar size. It is commonly used as a baseline clustering method and scales well to large datasets. For very large datasets, consider MiniBatchKMeans instead.
Code Reference
Source Location
- Repository: scikit-learn
- File: sklearn/cluster/_kmeans.py
Signature
class KMeans(_BaseKMeans):
def __init__(
self,
n_clusters=8,
*,
init="k-means++",
n_init="auto",
max_iter=300,
tol=1e-4,
verbose=0,
random_state=None,
copy_x=True,
algorithm="lloyd",
):
Import
from sklearn.cluster import KMeans
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| n_clusters | int | No | Number of clusters to form and centroids to generate. Default is 8. |
| init | str, callable, or array-like | No | Initialization method: "k-means++", "random", array of shape (n_clusters, n_features), or callable. Default is "k-means++". |
| n_init | "auto" or int | No | Number of times k-means is run with different seeds; best result is kept. Default is "auto". |
| max_iter | int | No | Maximum iterations per single run. Default is 300. |
| tol | float | No | Relative tolerance for convergence based on Frobenius norm of center changes. Default is 1e-4. |
| verbose | int | No | Verbosity mode. Default is 0. |
| random_state | int or RandomState | No | Random state for centroid initialization. Default is None. |
| copy_x | bool | No | Whether to copy input data before centering. Default is True. |
| algorithm | str | No | K-Means algorithm to use: "lloyd" or "elkan". Default is "lloyd". |
Outputs
| Name | Type | Description |
|---|---|---|
| cluster_centers_ | ndarray of shape (n_clusters, n_features) | Coordinates of cluster centers. |
| labels_ | ndarray of shape (n_samples,) | Label of each sample (index of closest center). |
| inertia_ | float | Sum of squared distances of samples to their closest cluster center. |
| n_iter_ | int | Number of iterations run. |
| n_features_in_ | int | Number of features seen during fit. |
Usage Examples
Basic Usage
from sklearn.cluster import KMeans
import numpy as np
X = np.array([[1, 2], [1, 4], [1, 0],
[10, 2], [10, 4], [10, 0]])
kmeans = KMeans(n_clusters=2, random_state=0, n_init="auto").fit(X)
print(kmeans.labels_)
print(kmeans.cluster_centers_)
print(kmeans.predict([[0, 0], [12, 3]]))