Heuristic:Scikit learn contrib Imbalanced learn KNeighbors Selection Tips

Knowledge Sources	imbalanced-learn
Domains	Imbalanced_Classification, Over_Sampling
Last Updated	2026-02-09 03:00 GMT

Overview

SMOTE internally adds +1 to k_neighbors for self-inclusion; KMeansSMOTE silently skips clusters with insufficient samples for the neighbor count.

Description

The `k_neighbors` parameter in SMOTE-based algorithms controls how many nearest neighbors are used to generate synthetic samples. Two non-obvious behaviors affect users: (1) BaseSMOTE adds 1 to the user-specified `k_neighbors` internally because the k-nearest-neighbors search includes the query sample itself, meaning `k_neighbors=5` actually queries 6 neighbors; (2) KMeansSMOTE silently skips clusters where the anticipated number of samples is fewer than `nn_k_.n_neighbors`, producing no synthetic samples for those clusters without any warning.

Usage

Apply this heuristic when configuring SMOTE, BorderlineSMOTE, SVMSMOTE, or KMeansSMOTE. Pay particular attention when working with small minority classes where the number of samples may be close to `k_neighbors`. For KMeansSMOTE, monitor whether clusters are being silently skipped by checking the output sample counts.

The Insight (Rule of Thumb)

Action: Ensure the minority class has at least `k_neighbors + 1` samples. For KMeansSMOTE, ensure each relevant cluster has enough samples relative to `k_neighbors`.
Value: Default `k_neighbors=5` requires at least 6 minority samples. Reduce `k_neighbors` for very small minority classes.
Trade-off: Lower `k_neighbors` produces less diverse synthetic samples (closer to existing points). Higher `k_neighbors` requires more minority samples and produces more varied synthetic data.
Silent failure: KMeansSMOTE skips clusters with insufficient samples without warning. If all clusters are skipped, a `RuntimeError` is raised: "No clusters found with sufficient samples."

Reasoning

The +1 neighbor addition is by design: SMOTE's algorithm selects a random neighbor from the k-nearest-neighbors of each minority sample, but the kNN search includes the sample itself in the result set. By querying k+1 neighbors, the algorithm retrieves exactly k other neighbors after excluding self.

For KMeansSMOTE, the cluster-level check prevents generating degenerate synthetic samples in sparse clusters. However, the silent skip can be surprising when users expect all clusters to contribute samples.

Code Evidence

Internal +1 neighbor addition from `imblearn/over_sampling/_smote/base.py:61-66`:

def _validate_estimator(self):
    """Check the NN estimators shared across the different SMOTE
    algorithms.
    """
    self.nn_k_ = check_neighbors_object(
        "k_neighbors", self.k_neighbors, additional_neighbor=1
    )

KMeansSMOTE silent cluster skip from `imblearn/over_sampling/_smote/cluster.py:254-257`:

# not enough samples to apply SMOTE
anticipated_samples = cluster_class_mean * X_cluster.shape[0]
if anticipated_samples < self.nn_k_.n_neighbors:
    continue

Cluster balance threshold check from `imblearn/over_sampling/_smote/cluster.py:245-252`:

if self.cluster_balance_threshold_ == "auto":
    balance_threshold = n_samples / total_inp_samples / 2
else:
    balance_threshold = self.cluster_balance_threshold_

# the cluster is already considered balanced
if cluster_class_mean < balance_threshold:
    continue

k_neighbors parameter constraint from `imblearn/over_sampling/_smote/base.py:43-48`:

_parameter_constraints: dict = {
    **BaseOverSampler._parameter_constraints,
    "k_neighbors": [
        Interval(numbers.Integral, 1, None, closed="left"),
        HasMethods(["kneighbors", "kneighbors_graph"]),
    ],
}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment