Heuristic:Scikit learn contrib Imbalanced learn KNeighbors Selection Tips
| Knowledge Sources | |
|---|---|
| Domains | Imbalanced_Classification, Over_Sampling |
| Last Updated | 2026-02-09 03:00 GMT |
Overview
SMOTE internally adds +1 to k_neighbors for self-inclusion; KMeansSMOTE silently skips clusters with insufficient samples for the neighbor count.
Description
The `k_neighbors` parameter in SMOTE-based algorithms controls how many nearest neighbors are used to generate synthetic samples. Two non-obvious behaviors affect users: (1) BaseSMOTE adds 1 to the user-specified `k_neighbors` internally because the k-nearest-neighbors search includes the query sample itself, meaning `k_neighbors=5` actually queries 6 neighbors; (2) KMeansSMOTE silently skips clusters where the anticipated number of samples is fewer than `nn_k_.n_neighbors`, producing no synthetic samples for those clusters without any warning.
Usage
Apply this heuristic when configuring SMOTE, BorderlineSMOTE, SVMSMOTE, or KMeansSMOTE. Pay particular attention when working with small minority classes where the number of samples may be close to `k_neighbors`. For KMeansSMOTE, monitor whether clusters are being silently skipped by checking the output sample counts.
The Insight (Rule of Thumb)
- Action: Ensure the minority class has at least `k_neighbors + 1` samples. For KMeansSMOTE, ensure each relevant cluster has enough samples relative to `k_neighbors`.
- Value: Default `k_neighbors=5` requires at least 6 minority samples. Reduce `k_neighbors` for very small minority classes.
- Trade-off: Lower `k_neighbors` produces less diverse synthetic samples (closer to existing points). Higher `k_neighbors` requires more minority samples and produces more varied synthetic data.
- Silent failure: KMeansSMOTE skips clusters with insufficient samples without warning. If all clusters are skipped, a `RuntimeError` is raised: "No clusters found with sufficient samples."
Reasoning
The +1 neighbor addition is by design: SMOTE's algorithm selects a random neighbor from the k-nearest-neighbors of each minority sample, but the kNN search includes the sample itself in the result set. By querying k+1 neighbors, the algorithm retrieves exactly k other neighbors after excluding self.
For KMeansSMOTE, the cluster-level check prevents generating degenerate synthetic samples in sparse clusters. However, the silent skip can be surprising when users expect all clusters to contribute samples.
Code Evidence
Internal +1 neighbor addition from `imblearn/over_sampling/_smote/base.py:61-66`:
def _validate_estimator(self):
"""Check the NN estimators shared across the different SMOTE
algorithms.
"""
self.nn_k_ = check_neighbors_object(
"k_neighbors", self.k_neighbors, additional_neighbor=1
)
KMeansSMOTE silent cluster skip from `imblearn/over_sampling/_smote/cluster.py:254-257`:
# not enough samples to apply SMOTE
anticipated_samples = cluster_class_mean * X_cluster.shape[0]
if anticipated_samples < self.nn_k_.n_neighbors:
continue
Cluster balance threshold check from `imblearn/over_sampling/_smote/cluster.py:245-252`:
if self.cluster_balance_threshold_ == "auto":
balance_threshold = n_samples / total_inp_samples / 2
else:
balance_threshold = self.cluster_balance_threshold_
# the cluster is already considered balanced
if cluster_class_mean < balance_threshold:
continue
k_neighbors parameter constraint from `imblearn/over_sampling/_smote/base.py:43-48`:
_parameter_constraints: dict = {
**BaseOverSampler._parameter_constraints,
"k_neighbors": [
Interval(numbers.Integral, 1, None, closed="left"),
HasMethods(["kneighbors", "kneighbors_graph"]),
],
}
Related Pages
- Implementation:Scikit_learn_contrib_Imbalanced_learn_SMOTE
- Implementation:Scikit_learn_contrib_Imbalanced_learn_BorderlineSMOTE
- Implementation:Scikit_learn_contrib_Imbalanced_learn_SVMSMOTE
- Implementation:Scikit_learn_contrib_Imbalanced_learn_KMeansSMOTE
- Implementation:Scikit_learn_contrib_Imbalanced_learn_ADASYN
- Principle:Scikit_learn_contrib_Imbalanced_learn_Synthetic_Minority_Oversampling
- Principle:Scikit_learn_contrib_Imbalanced_learn_Cluster_Based_Oversampling