Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Scikit learn contrib Imbalanced learn NeighbourhoodCleaningRule

From Leeroopedia
Revision as of 16:38, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Scikit_learn_contrib_Imbalanced_learn_NeighbourhoodCleaningRule.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Machine_Learning, Data_Preprocessing, Imbalanced_Learning
Last Updated 2026-02-09 03:00 GMT

Overview

Concrete tool for under-sampling based on the Neighbourhood Cleaning Rule provided by the imbalanced-learn library.

Description

The NeighbourhoodCleaningRule class implements a two-phase data cleaning approach that combines Edited Nearest Neighbours (ENN) with K-Nearest Neighbours to remove noisy majority-class samples. It extends BaseCleaningSampler and operates in two phases: first applying ENN to identify majority samples misclassified by their neighbourhood, then using K-NN on minority samples to find additional majority neighbours that cause misclassification of minority instances. The union of both sets of identified samples is removed.

Usage

Import this class when you need a neighbourhood-based cleaning approach that not only removes noisy majority samples (via ENN) but also targets majority samples that are harmful to minority class recognition.

Code Reference

Source Location

  • Repository: imbalanced-learn
  • File: imblearn/under_sampling/_prototype_selection/_neighbourhood_cleaning_rule.py
  • Lines: L1-239

Signature

class NeighbourhoodCleaningRule(BaseCleaningSampler):
    def __init__(
        self,
        *,
        sampling_strategy="auto",
        edited_nearest_neighbours=None,
        n_neighbors=3,
        threshold_cleaning=0.5,
        n_jobs=None,
    ):
        """
        Args:
            sampling_strategy: str, dict, or callable - Desired ratio of
                minority to majority samples. 'auto' targets all majority
                classes.
            edited_nearest_neighbours: EditedNearestNeighbours or None -
                Custom ENN object for Phase 1 cleaning. Defaults to ENN
                with kind_sel="mode" and the specified n_neighbors.
            n_neighbors: int or KNeighborsMixin estimator - Number of
                nearest neighbours for the K-NN classifier in Phase 2
                (default: 3).
            threshold_cleaning: float - Threshold for deciding which
                classes to clean in Phase 2. A class is cleaned when
                its size > minority_size * threshold (default: 0.5).
            n_jobs: int or None - Number of parallel jobs.
        """

Import

from imblearn.under_sampling import NeighbourhoodCleaningRule

I/O Contract

Inputs

Name Type Required Description
X {array-like, sparse matrix, dataframe} of shape (n_samples, n_features) Yes Feature matrix of training data
y array-like of shape (n_samples,) Yes Target labels
sampling_strategy str, dict, or callable No Resampling ratio (default: 'auto')
edited_nearest_neighbours EditedNearestNeighbours or None No Custom ENN object for Phase 1 (default: None)
n_neighbors int or KNeighborsMixin estimator No Neighbours for K-NN classifier (default: 3)
threshold_cleaning float No Class size threshold for Phase 2 cleaning (default: 0.5)
n_jobs int or None No Number of parallel jobs (default: None)

Outputs

Name Type Description
X_resampled {ndarray, sparse matrix, dataframe} of shape (n_samples_new, n_features) Feature matrix with noisy majority samples removed
y_resampled ndarray of shape (n_samples_new,) Target array after neighbourhood cleaning

Key Attributes After Fitting

Attribute Type Description
sampling_strategy_ dict Maps class labels to number of samples to sample
edited_nearest_neighbours_ estimator object The ENN object used for Phase 1 resampling
nn_ estimator object Validated K-Nearest Neighbours classifier used in Phase 2
classes_to_clean_ list Classes considered for under-sampling during Phase 2
sample_indices_ ndarray of shape (n_new_samples,) Indices of samples selected from the original dataset
n_features_in_ int Number of features in the input dataset
feature_names_in_ ndarray of shape (n_features_in_,) Names of features seen during fit (when X has string feature names)

Usage Examples

Basic Usage

from collections import Counter
from sklearn.datasets import make_classification
from imblearn.under_sampling import NeighbourhoodCleaningRule

# Create imbalanced dataset
X, y = make_classification(
    n_classes=2, class_sep=2, weights=[0.1, 0.9],
    n_informative=3, n_redundant=1, flip_y=0,
    n_features=20, n_clusters_per_class=1,
    n_samples=1000, random_state=10,
)
print(f"Original: {Counter(y)}")
# Original: Counter({1: 900, 0: 100})

# Apply NeighbourhoodCleaningRule
ncr = NeighbourhoodCleaningRule()
X_res, y_res = ncr.fit_resample(X, y)
print(f"Resampled: {Counter(y_res)}")
# Resampled: Counter({1: 888, 0: 100})

Custom ENN and Threshold

from imblearn.under_sampling import (
    EditedNearestNeighbours,
    NeighbourhoodCleaningRule,
)

# Configure custom ENN for Phase 1
custom_enn = EditedNearestNeighbours(n_neighbors=5, kind_sel="all")

ncr = NeighbourhoodCleaningRule(
    edited_nearest_neighbours=custom_enn,
    n_neighbors=5,
    threshold_cleaning=0.3,
)
X_res, y_res = ncr.fit_resample(X, y)

In a Pipeline

from imblearn.pipeline import make_pipeline
from imblearn.under_sampling import NeighbourhoodCleaningRule
from sklearn.ensemble import GradientBoostingClassifier

pipeline = make_pipeline(
    NeighbourhoodCleaningRule(),
    GradientBoostingClassifier(random_state=42),
)
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment