Implementation:Scikit learn contrib Imbalanced learn CondensedNearestNeighbour
| Knowledge Sources | |
|---|---|
| Domains | Machine_Learning, Data_Preprocessing, Imbalanced_Learning |
| Last Updated | 2026-02-09 03:00 GMT |
Overview
Under-sampling technique that iteratively builds a consistent subset of the training data by retaining only majority samples that are misclassified by a 1-nearest-neighbor classifier trained on the current subset.
Description
The CondensedNearestNeighbour class implements the Condensed Nearest Neighbour (CNN) rule for under-sampling majority class instances. It extends BaseCleaningSampler and works by initializing a subset with all minority samples plus a random seed from each majority class, then iteratively adding misclassified majority samples until the subset is consistent (i.e., every sample in the original training set is correctly classified by a 1-NN classifier using only the subset). The class integrates with scikit-learn's estimator API, supporting pipeline composition, parameter validation, and multi-class resampling via a one-vs.-rest scheme.
Usage
Import this class when you need to reduce the size of a majority class while preserving samples near the decision boundary. Use it as a standalone resampler via fit_resample() or as a step in an imblearn.pipeline.Pipeline. CNN is most effective when the majority class contains large regions of redundant samples far from the decision boundary.
Code Reference
Source Location
- Repository: imbalanced-learn
- File: imblearn/under_sampling/_prototype_selection/_condensed_nearest_neighbour.py
- Lines: L1-247
Signature
class CondensedNearestNeighbour(BaseCleaningSampler):
def __init__(
self,
*,
sampling_strategy="auto",
random_state=None,
n_neighbors=None,
n_seeds_S=1,
n_jobs=None,
):
"""
Args:
sampling_strategy: str, dict, or callable - Desired ratio of
samples after resampling. 'auto' targets all classes except
the minority class.
random_state: int, RandomState, or None - Seed for reproducibility
when selecting the initial majority seed samples.
n_neighbors: int, KNeighborsClassifier, or None - Number of
nearest neighbors for classification. None defaults to 1-NN.
n_seeds_S: int - Number of initial majority samples to seed the
condensed set (default: 1).
n_jobs: int or None - Number of parallel jobs for the nearest
neighbor classifier.
"""
Import
from imblearn.under_sampling import CondensedNearestNeighbour
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| X | {array-like, sparse matrix, dataframe} of shape (n_samples, n_features) | Yes | Feature matrix of training data |
| y | array-like of shape (n_samples,) | Yes | Target labels indicating class membership |
| sampling_strategy | str, dict, or callable | No | Resampling target; 'auto' targets all classes except the minority |
| n_neighbors | int, KNeighborsClassifier, or None | No | Neighbor count or estimator for 1-NN classification (default: None, i.e. 1-NN) |
| n_seeds_S | int | No | Number of random majority seeds to initialize the condensed set (default: 1) |
| random_state | int, RandomState, or None | No | Random seed for reproducibility |
| n_jobs | int or None | No | Number of parallel jobs for the nearest neighbor search |
Outputs
| Name | Type | Description |
|---|---|---|
| X_resampled | {ndarray, sparse matrix, dataframe} of shape (n_samples_new, n_features) | Feature matrix with redundant majority samples removed |
| y_resampled | ndarray of shape (n_samples_new,) | Target array with corresponding labels for the condensed subset |
Attributes
| Name | Type | Description |
|---|---|---|
| sampling_strategy_ | dict | Maps class labels to the number of samples to remove |
| estimators_ | list of KNeighborsClassifier | One fitted 1-NN estimator per resampled class |
| sample_indices_ | ndarray of shape (n_new_samples,) | Indices of selected samples from the original dataset |
| n_features_in_ | int | Number of features seen during fit |
| feature_names_in_ | ndarray of shape (n_features_in_,) | Feature names seen during fit (when X has string feature names) |
Usage Examples
Basic Under-sampling
from collections import Counter
from sklearn.datasets import make_classification
from imblearn.under_sampling import CondensedNearestNeighbour
# 1. Create an imbalanced dataset
X, y = make_classification(
n_classes=2, class_sep=2, weights=[0.1, 0.9],
n_informative=3, n_redundant=1, flip_y=0,
n_features=20, n_clusters_per_class=1,
n_samples=1000, random_state=10,
)
print(f"Original: {Counter(y)}")
# 2. Apply Condensed Nearest Neighbour
cnn = CondensedNearestNeighbour(random_state=42)
X_resampled, y_resampled = cnn.fit_resample(X, y)
print(f"Resampled: {Counter(y_resampled)}")
Inside a Pipeline
from imblearn.pipeline import make_pipeline
from imblearn.under_sampling import CondensedNearestNeighbour
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_validate
# Build pipeline with CNN + classifier
pipeline = make_pipeline(
CondensedNearestNeighbour(random_state=42),
LinearSVC(),
)
# Cross-validate (CNN applied only to training folds)
scores = cross_validate(pipeline, X, y, scoring="balanced_accuracy", cv=5)
print(f"Mean balanced accuracy: {scores['test_score'].mean():.3f}")
Custom Neighbor Count
from imblearn.under_sampling import CondensedNearestNeighbour
# Use 3-NN instead of the default 1-NN
cnn = CondensedNearestNeighbour(
n_neighbors=3,
n_seeds_S=1,
random_state=42,
)
X_res, y_res = cnn.fit_resample(X, y)
# Inspect which samples were retained
print(f"Retained indices: {cnn.sample_indices_}")