Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Scikit learn contrib Imbalanced learn NearMiss

From Leeroopedia


Knowledge Sources
Domains Machine_Learning, Under_Sampling, Imbalanced_Learning
Last Updated 2026-02-09 03:00 GMT

Overview

Concrete tool for performing under-sampling of majority classes based on nearest-neighbor distance methods provided by the imbalanced-learn library.

Description

The NearMiss class implements three distance-based under-sampling strategies that reduce majority class samples by selecting those closest to (or most relevant to) the minority class boundary. It extends BaseUnderSampler and integrates with scikit-learn's estimator API, supporting pipeline composition, parameter validation, and metadata routing.

Three algorithm variants are available:

  • NearMiss-1 selects majority class samples whose average distance to their k nearest minority neighbors is smallest. This retains majority samples that are close to the minority class boundary.
  • NearMiss-2 selects majority class samples whose average distance to their k farthest minority neighbors is smallest. This retains majority samples that are close to all minority samples, not just the nearest ones.
  • NearMiss-3 operates in two phases. First, for each minority sample, the algorithm identifies the k nearest majority neighbors, forming a candidate subset. Then from this candidate subset, it selects the majority samples whose average distance to their k nearest minority neighbors is largest. This keeps majority samples that provide the widest margin around minority instances.

The algorithm fits a k-nearest neighbors model on the minority class and computes distances between majority class samples and minority class samples. These distances drive the selection of which majority samples to retain.

Usage

Import this class when you need to reduce the number of majority class samples to balance a dataset before training a classifier. Use it as a standalone resampler via fit_resample() or as a step in an imblearn.pipeline.Pipeline. NearMiss is especially useful when you want an informed under-sampling strategy that considers the geometric relationship between majority and minority classes, rather than removing majority samples at random.

Code Reference

Source Location

Signature

class NearMiss(BaseUnderSampler):
    def __init__(
        self,
        *,
        sampling_strategy="auto",
        version=1,
        n_neighbors=3,
        n_neighbors_ver3=3,
        n_jobs=None,
    ):
        """
        Args:
            sampling_strategy: str, float, dict, or callable - Desired ratio of
                minority to majority samples. 'auto' equalizes all classes.
            version: int (1, 2, or 3) - Version of the NearMiss algorithm to use.
            n_neighbors: int or KNeighborsMixin estimator - Number of nearest
                neighbors used to compute average distances (default: 3).
            n_neighbors_ver3: int or KNeighborsMixin estimator - Number of
                nearest neighbors used in the initial candidate selection phase
                of NearMiss-3 (default: 3).
            n_jobs: int or None - Number of parallel jobs for nearest neighbor
                search. None means 1 unless inside a joblib.parallel_backend.
        """

Import

from imblearn.under_sampling import NearMiss

I/O Contract

Inputs

Name Type Required Description
X {array-like, sparse matrix, dataframe} of shape (n_samples, n_features) Yes Feature matrix of training data
y array-like of shape (n_samples,) Yes Target labels indicating class membership
sampling_strategy str, float, dict, or callable No Resampling ratio; 'auto' equalizes all classes (default: 'auto')
version int (1, 2, or 3) No NearMiss algorithm version to use (default: 1)
n_neighbors int or KNeighborsMixin estimator No Number of nearest neighbors for distance computation (default: 3)
n_neighbors_ver3 int or KNeighborsMixin estimator No Number of nearest neighbors for the candidate selection phase in NearMiss-3 (default: 3)
n_jobs int or None No Number of parallel jobs for nearest neighbor search (default: None)

Outputs

Name Type Description
X_resampled {ndarray, sparse matrix, dataframe} of shape (n_samples_new, n_features) Feature matrix with majority class samples reduced
y_resampled ndarray of shape (n_samples_new,) Target array with corresponding labels after under-sampling

Attributes

Name Type Description
sampling_strategy_ dict Dictionary mapping class labels to the number of samples to select for each class
nn_ estimator object Validated K-nearest Neighbours estimator created from the n_neighbors parameter
nn_ver3_ estimator object Validated K-nearest Neighbours estimator created from the n_neighbors_ver3 parameter (only set when version=3)
sample_indices_ ndarray of shape (n_new_samples,) Indices of the samples selected from the original dataset
n_features_in_ int Number of features in the input dataset
feature_names_in_ ndarray of shape (n_features_in_,) Names of features seen during fit (only when X has string feature names)

Usage Examples

Basic Under-Sampling with NearMiss-1

from collections import Counter
from sklearn.datasets import make_classification
from imblearn.under_sampling import NearMiss

# 1. Create an imbalanced dataset
X, y = make_classification(
    n_classes=2, class_sep=2, weights=[0.1, 0.9],
    n_informative=3, n_redundant=1, flip_y=0,
    n_features=20, n_clusters_per_class=1,
    n_samples=1000, random_state=10,
)
print(f"Original: {Counter(y)}")
# Original: Counter({1: 900, 0: 100})

# 2. Apply NearMiss-1 (default)
nm = NearMiss()
X_resampled, y_resampled = nm.fit_resample(X, y)
print(f"Resampled: {Counter(y_resampled)}")
# Resampled: Counter({0: 100, 1: 100})

Selecting a NearMiss Version

from imblearn.under_sampling import NearMiss

# NearMiss-2: selects majority samples closest to farthest minority neighbors
nm2 = NearMiss(version=2, n_neighbors=5)
X_res2, y_res2 = nm2.fit_resample(X, y)

# NearMiss-3: two-phase selection with candidate pre-filtering
nm3 = NearMiss(version=3, n_neighbors=3, n_neighbors_ver3=3)
X_res3, y_res3 = nm3.fit_resample(X, y)

Inside a Pipeline

from imblearn.pipeline import make_pipeline
from imblearn.under_sampling import NearMiss
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_validate

# Build pipeline with NearMiss + classifier
pipeline = make_pipeline(NearMiss(version=1), LinearSVC())

# Cross-validate (NearMiss applied only to training folds)
scores = cross_validate(pipeline, X, y, scoring="balanced_accuracy", cv=5)
print(f"Mean balanced accuracy: {scores['test_score'].mean():.3f}")

Custom Sampling Strategy

from imblearn.under_sampling import NearMiss

# Specify exact number of samples to retain per majority class
nm = NearMiss(
    sampling_strategy={1: 200},  # Retain 200 majority class samples
    version=1,
    n_neighbors=5,
)
X_res, y_res = nm.fit_resample(X, y)

Accessing Selected Sample Indices

from imblearn.under_sampling import NearMiss

nm = NearMiss(version=1)
X_res, y_res = nm.fit_resample(X, y)

# Retrieve indices of samples retained from the original dataset
selected_indices = nm.sample_indices_
print(f"Number of selected samples: {len(selected_indices)}")

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment