Implementation:Scikit learn contrib Imbalanced learn NeighbourhoodCleaningRule
| Knowledge Sources | |
|---|---|
| Domains | Machine_Learning, Data_Preprocessing, Imbalanced_Learning |
| Last Updated | 2026-02-09 03:00 GMT |
Overview
Concrete tool for under-sampling based on the Neighbourhood Cleaning Rule provided by the imbalanced-learn library.
Description
The NeighbourhoodCleaningRule class implements a two-phase data cleaning approach that combines Edited Nearest Neighbours (ENN) with K-Nearest Neighbours to remove noisy majority-class samples. It extends BaseCleaningSampler and operates in two phases: first applying ENN to identify majority samples misclassified by their neighbourhood, then using K-NN on minority samples to find additional majority neighbours that cause misclassification of minority instances. The union of both sets of identified samples is removed.
Usage
Import this class when you need a neighbourhood-based cleaning approach that not only removes noisy majority samples (via ENN) but also targets majority samples that are harmful to minority class recognition.
Code Reference
Source Location
- Repository: imbalanced-learn
- File: imblearn/under_sampling/_prototype_selection/_neighbourhood_cleaning_rule.py
- Lines: L1-239
Signature
class NeighbourhoodCleaningRule(BaseCleaningSampler):
def __init__(
self,
*,
sampling_strategy="auto",
edited_nearest_neighbours=None,
n_neighbors=3,
threshold_cleaning=0.5,
n_jobs=None,
):
"""
Args:
sampling_strategy: str, dict, or callable - Desired ratio of
minority to majority samples. 'auto' targets all majority
classes.
edited_nearest_neighbours: EditedNearestNeighbours or None -
Custom ENN object for Phase 1 cleaning. Defaults to ENN
with kind_sel="mode" and the specified n_neighbors.
n_neighbors: int or KNeighborsMixin estimator - Number of
nearest neighbours for the K-NN classifier in Phase 2
(default: 3).
threshold_cleaning: float - Threshold for deciding which
classes to clean in Phase 2. A class is cleaned when
its size > minority_size * threshold (default: 0.5).
n_jobs: int or None - Number of parallel jobs.
"""
Import
from imblearn.under_sampling import NeighbourhoodCleaningRule
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| X | {array-like, sparse matrix, dataframe} of shape (n_samples, n_features) | Yes | Feature matrix of training data |
| y | array-like of shape (n_samples,) | Yes | Target labels |
| sampling_strategy | str, dict, or callable | No | Resampling ratio (default: 'auto') |
| edited_nearest_neighbours | EditedNearestNeighbours or None | No | Custom ENN object for Phase 1 (default: None) |
| n_neighbors | int or KNeighborsMixin estimator | No | Neighbours for K-NN classifier (default: 3) |
| threshold_cleaning | float | No | Class size threshold for Phase 2 cleaning (default: 0.5) |
| n_jobs | int or None | No | Number of parallel jobs (default: None) |
Outputs
| Name | Type | Description |
|---|---|---|
| X_resampled | {ndarray, sparse matrix, dataframe} of shape (n_samples_new, n_features) | Feature matrix with noisy majority samples removed |
| y_resampled | ndarray of shape (n_samples_new,) | Target array after neighbourhood cleaning |
Key Attributes After Fitting
| Attribute | Type | Description |
|---|---|---|
| sampling_strategy_ | dict | Maps class labels to number of samples to sample |
| edited_nearest_neighbours_ | estimator object | The ENN object used for Phase 1 resampling |
| nn_ | estimator object | Validated K-Nearest Neighbours classifier used in Phase 2 |
| classes_to_clean_ | list | Classes considered for under-sampling during Phase 2 |
| sample_indices_ | ndarray of shape (n_new_samples,) | Indices of samples selected from the original dataset |
| n_features_in_ | int | Number of features in the input dataset |
| feature_names_in_ | ndarray of shape (n_features_in_,) | Names of features seen during fit (when X has string feature names) |
Usage Examples
Basic Usage
from collections import Counter
from sklearn.datasets import make_classification
from imblearn.under_sampling import NeighbourhoodCleaningRule
# Create imbalanced dataset
X, y = make_classification(
n_classes=2, class_sep=2, weights=[0.1, 0.9],
n_informative=3, n_redundant=1, flip_y=0,
n_features=20, n_clusters_per_class=1,
n_samples=1000, random_state=10,
)
print(f"Original: {Counter(y)}")
# Original: Counter({1: 900, 0: 100})
# Apply NeighbourhoodCleaningRule
ncr = NeighbourhoodCleaningRule()
X_res, y_res = ncr.fit_resample(X, y)
print(f"Resampled: {Counter(y_res)}")
# Resampled: Counter({1: 888, 0: 100})
Custom ENN and Threshold
from imblearn.under_sampling import (
EditedNearestNeighbours,
NeighbourhoodCleaningRule,
)
# Configure custom ENN for Phase 1
custom_enn = EditedNearestNeighbours(n_neighbors=5, kind_sel="all")
ncr = NeighbourhoodCleaningRule(
edited_nearest_neighbours=custom_enn,
n_neighbors=5,
threshold_cleaning=0.3,
)
X_res, y_res = ncr.fit_resample(X, y)
In a Pipeline
from imblearn.pipeline import make_pipeline
from imblearn.under_sampling import NeighbourhoodCleaningRule
from sklearn.ensemble import GradientBoostingClassifier
pipeline = make_pipeline(
NeighbourhoodCleaningRule(),
GradientBoostingClassifier(random_state=42),
)
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)