Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Scikit learn contrib Imbalanced learn CondensedNearestNeighbour

From Leeroopedia


Knowledge Sources
Domains Machine_Learning, Data_Preprocessing, Imbalanced_Learning
Last Updated 2026-02-09 03:00 GMT

Overview

Under-sampling technique that iteratively builds a consistent subset of the training data by retaining only majority samples that are misclassified by a 1-nearest-neighbor classifier trained on the current subset.

Description

The CondensedNearestNeighbour class implements the Condensed Nearest Neighbour (CNN) rule for under-sampling majority class instances. It extends BaseCleaningSampler and works by initializing a subset with all minority samples plus a random seed from each majority class, then iteratively adding misclassified majority samples until the subset is consistent (i.e., every sample in the original training set is correctly classified by a 1-NN classifier using only the subset). The class integrates with scikit-learn's estimator API, supporting pipeline composition, parameter validation, and multi-class resampling via a one-vs.-rest scheme.

Usage

Import this class when you need to reduce the size of a majority class while preserving samples near the decision boundary. Use it as a standalone resampler via fit_resample() or as a step in an imblearn.pipeline.Pipeline. CNN is most effective when the majority class contains large regions of redundant samples far from the decision boundary.

Code Reference

Source Location

  • Repository: imbalanced-learn
  • File: imblearn/under_sampling/_prototype_selection/_condensed_nearest_neighbour.py
  • Lines: L1-247

Signature

class CondensedNearestNeighbour(BaseCleaningSampler):
    def __init__(
        self,
        *,
        sampling_strategy="auto",
        random_state=None,
        n_neighbors=None,
        n_seeds_S=1,
        n_jobs=None,
    ):
        """
        Args:
            sampling_strategy: str, dict, or callable - Desired ratio of
                samples after resampling. 'auto' targets all classes except
                the minority class.
            random_state: int, RandomState, or None - Seed for reproducibility
                when selecting the initial majority seed samples.
            n_neighbors: int, KNeighborsClassifier, or None - Number of
                nearest neighbors for classification. None defaults to 1-NN.
            n_seeds_S: int - Number of initial majority samples to seed the
                condensed set (default: 1).
            n_jobs: int or None - Number of parallel jobs for the nearest
                neighbor classifier.
        """

Import

from imblearn.under_sampling import CondensedNearestNeighbour

I/O Contract

Inputs

Name Type Required Description
X {array-like, sparse matrix, dataframe} of shape (n_samples, n_features) Yes Feature matrix of training data
y array-like of shape (n_samples,) Yes Target labels indicating class membership
sampling_strategy str, dict, or callable No Resampling target; 'auto' targets all classes except the minority
n_neighbors int, KNeighborsClassifier, or None No Neighbor count or estimator for 1-NN classification (default: None, i.e. 1-NN)
n_seeds_S int No Number of random majority seeds to initialize the condensed set (default: 1)
random_state int, RandomState, or None No Random seed for reproducibility
n_jobs int or None No Number of parallel jobs for the nearest neighbor search

Outputs

Name Type Description
X_resampled {ndarray, sparse matrix, dataframe} of shape (n_samples_new, n_features) Feature matrix with redundant majority samples removed
y_resampled ndarray of shape (n_samples_new,) Target array with corresponding labels for the condensed subset

Attributes

Name Type Description
sampling_strategy_ dict Maps class labels to the number of samples to remove
estimators_ list of KNeighborsClassifier One fitted 1-NN estimator per resampled class
sample_indices_ ndarray of shape (n_new_samples,) Indices of selected samples from the original dataset
n_features_in_ int Number of features seen during fit
feature_names_in_ ndarray of shape (n_features_in_,) Feature names seen during fit (when X has string feature names)

Usage Examples

Basic Under-sampling

from collections import Counter
from sklearn.datasets import make_classification
from imblearn.under_sampling import CondensedNearestNeighbour

# 1. Create an imbalanced dataset
X, y = make_classification(
    n_classes=2, class_sep=2, weights=[0.1, 0.9],
    n_informative=3, n_redundant=1, flip_y=0,
    n_features=20, n_clusters_per_class=1,
    n_samples=1000, random_state=10,
)
print(f"Original: {Counter(y)}")

# 2. Apply Condensed Nearest Neighbour
cnn = CondensedNearestNeighbour(random_state=42)
X_resampled, y_resampled = cnn.fit_resample(X, y)
print(f"Resampled: {Counter(y_resampled)}")

Inside a Pipeline

from imblearn.pipeline import make_pipeline
from imblearn.under_sampling import CondensedNearestNeighbour
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_validate

# Build pipeline with CNN + classifier
pipeline = make_pipeline(
    CondensedNearestNeighbour(random_state=42),
    LinearSVC(),
)

# Cross-validate (CNN applied only to training folds)
scores = cross_validate(pipeline, X, y, scoring="balanced_accuracy", cv=5)
print(f"Mean balanced accuracy: {scores['test_score'].mean():.3f}")

Custom Neighbor Count

from imblearn.under_sampling import CondensedNearestNeighbour

# Use 3-NN instead of the default 1-NN
cnn = CondensedNearestNeighbour(
    n_neighbors=3,
    n_seeds_S=1,
    random_state=42,
)
X_res, y_res = cnn.fit_resample(X, y)

# Inspect which samples were retained
print(f"Retained indices: {cnn.sample_indices_}")

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment