Implementation:Scikit learn contrib Imbalanced learn NearMiss

Knowledge Sources	Scikit_learn_contrib_Imbalanced_learn NearMiss
Domains	Machine_Learning, Under_Sampling, Imbalanced_Learning
Last Updated	2026-02-09 03:00 GMT

Overview

Concrete tool for performing under-sampling of majority classes based on nearest-neighbor distance methods provided by the imbalanced-learn library.

Description

The NearMiss class implements three distance-based under-sampling strategies that reduce majority class samples by selecting those closest to (or most relevant to) the minority class boundary. It extends BaseUnderSampler and integrates with scikit-learn's estimator API, supporting pipeline composition, parameter validation, and metadata routing.

Three algorithm variants are available:

NearMiss-1 selects majority class samples whose average distance to their k nearest minority neighbors is smallest. This retains majority samples that are close to the minority class boundary.
NearMiss-2 selects majority class samples whose average distance to their k farthest minority neighbors is smallest. This retains majority samples that are close to all minority samples, not just the nearest ones.
NearMiss-3 operates in two phases. First, for each minority sample, the algorithm identifies the k nearest majority neighbors, forming a candidate subset. Then from this candidate subset, it selects the majority samples whose average distance to their k nearest minority neighbors is largest. This keeps majority samples that provide the widest margin around minority instances.

The algorithm fits a k-nearest neighbors model on the minority class and computes distances between majority class samples and minority class samples. These distances drive the selection of which majority samples to retain.

Usage

Import this class when you need to reduce the number of majority class samples to balance a dataset before training a classifier. Use it as a standalone resampler via fit_resample() or as a step in an imblearn.pipeline.Pipeline. NearMiss is especially useful when you want an informed under-sampling strategy that considers the geometric relationship between majority and minority classes, rather than removing majority samples at random.

Code Reference

Source Location

Repository: Scikit_learn_contrib_Imbalanced_learn
File: imblearn/under_sampling/_prototype_selection/_nearmiss.py
Lines: 24-322

Signature

class NearMiss(BaseUnderSampler):
    def __init__(
        self,
        *,
        sampling_strategy="auto",
        version=1,
        n_neighbors=3,
        n_neighbors_ver3=3,
        n_jobs=None,
    ):
        """
        Args:
            sampling_strategy: str, float, dict, or callable - Desired ratio of
                minority to majority samples. 'auto' equalizes all classes.
            version: int (1, 2, or 3) - Version of the NearMiss algorithm to use.
            n_neighbors: int or KNeighborsMixin estimator - Number of nearest
                neighbors used to compute average distances (default: 3).
            n_neighbors_ver3: int or KNeighborsMixin estimator - Number of
                nearest neighbors used in the initial candidate selection phase
                of NearMiss-3 (default: 3).
            n_jobs: int or None - Number of parallel jobs for nearest neighbor
                search. None means 1 unless inside a joblib.parallel_backend.
        """

Import

from imblearn.under_sampling import NearMiss

I/O Contract

Inputs

Name	Type	Required	Description
X	{array-like, sparse matrix, dataframe} of shape (n_samples, n_features)	Yes	Feature matrix of training data
y	array-like of shape (n_samples,)	Yes	Target labels indicating class membership
sampling_strategy	str, float, dict, or callable	No	Resampling ratio; 'auto' equalizes all classes (default: 'auto')
version	int (1, 2, or 3)	No	NearMiss algorithm version to use (default: 1)
n_neighbors	int or KNeighborsMixin estimator	No	Number of nearest neighbors for distance computation (default: 3)
n_neighbors_ver3	int or KNeighborsMixin estimator	No	Number of nearest neighbors for the candidate selection phase in NearMiss-3 (default: 3)
n_jobs	int or None	No	Number of parallel jobs for nearest neighbor search (default: None)

Outputs

Name	Type	Description
X_resampled	{ndarray, sparse matrix, dataframe} of shape (n_samples_new, n_features)	Feature matrix with majority class samples reduced
y_resampled	ndarray of shape (n_samples_new,)	Target array with corresponding labels after under-sampling

Attributes

Name	Type	Description
sampling_strategy_	dict	Dictionary mapping class labels to the number of samples to select for each class
nn_	estimator object	Validated K-nearest Neighbours estimator created from the n_neighbors parameter
nn_ver3_	estimator object	Validated K-nearest Neighbours estimator created from the n_neighbors_ver3 parameter (only set when version=3)
sample_indices_	ndarray of shape (n_new_samples,)	Indices of the samples selected from the original dataset
n_features_in_	int	Number of features in the input dataset
feature_names_in_	ndarray of shape (n_features_in_,)	Names of features seen during fit (only when X has string feature names)

Usage Examples

Basic Under-Sampling with NearMiss-1

from collections import Counter
from sklearn.datasets import make_classification
from imblearn.under_sampling import NearMiss

# 1. Create an imbalanced dataset
X, y = make_classification(
    n_classes=2, class_sep=2, weights=[0.1, 0.9],
    n_informative=3, n_redundant=1, flip_y=0,
    n_features=20, n_clusters_per_class=1,
    n_samples=1000, random_state=10,
)
print(f"Original: {Counter(y)}")
# Original: Counter({1: 900, 0: 100})

# 2. Apply NearMiss-1 (default)
nm = NearMiss()
X_resampled, y_resampled = nm.fit_resample(X, y)
print(f"Resampled: {Counter(y_resampled)}")
# Resampled: Counter({0: 100, 1: 100})

Selecting a NearMiss Version

from imblearn.under_sampling import NearMiss

# NearMiss-2: selects majority samples closest to farthest minority neighbors
nm2 = NearMiss(version=2, n_neighbors=5)
X_res2, y_res2 = nm2.fit_resample(X, y)

# NearMiss-3: two-phase selection with candidate pre-filtering
nm3 = NearMiss(version=3, n_neighbors=3, n_neighbors_ver3=3)
X_res3, y_res3 = nm3.fit_resample(X, y)

Inside a Pipeline

from imblearn.pipeline import make_pipeline
from imblearn.under_sampling import NearMiss
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_validate

# Build pipeline with NearMiss + classifier
pipeline = make_pipeline(NearMiss(version=1), LinearSVC())

# Cross-validate (NearMiss applied only to training folds)
scores = cross_validate(pipeline, X, y, scoring="balanced_accuracy", cv=5)
print(f"Mean balanced accuracy: {scores['test_score'].mean():.3f}")

Custom Sampling Strategy

from imblearn.under_sampling import NearMiss

# Specify exact number of samples to retain per majority class
nm = NearMiss(
    sampling_strategy={1: 200},  # Retain 200 majority class samples
    version=1,
    n_neighbors=5,
)
X_res, y_res = nm.fit_resample(X, y)

Accessing Selected Sample Indices

from imblearn.under_sampling import NearMiss

nm = NearMiss(version=1)
X_res, y_res = nm.fit_resample(X, y)

# Retrieve indices of samples retained from the original dataset
selected_indices = nm.sample_indices_
print(f"Number of selected samples: {len(selected_indices)}")

Related Pages

Implements Principle

Principle:Scikit_learn_contrib_Imbalanced_learn_NearMiss_Under_Sampling

Requires Environment

Environment:Scikit_learn_contrib_Imbalanced_learn_Python_Scikit_learn

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment