Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Scikit learn contrib Imbalanced learn Under Sampling Base Abstraction

From Leeroopedia
Revision as of 17:38, 16 February 2026 by Admin (talk | contribs) (Auto-imported from principles/Scikit_learn_contrib_Imbalanced_learn_Under_Sampling_Base_Abstraction.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Principle: Under-Sampling Base Abstraction

The Under-Sampling Base Abstraction defines a design pattern for organizing under-sampling algorithms into two distinct categories based on their resampling semantics. This abstraction ensures a consistent API across all under-sampling implementations in the imbalanced-learn library while accurately reflecting the fundamental difference between controlled resampling and cleaning-based resampling.

Motivation

Under-sampling algorithms aim to address class imbalance by reducing the number of samples in the majority class (or classes). However, not all under-sampling algorithms work the same way:

  • Some algorithms reduce the majority class to a specific target count (e.g., "make the majority class the same size as the minority class").
  • Other algorithms remove noisy or borderline samples without guaranteeing any particular final count (e.g., "remove all samples that are misclassified by their nearest neighbors").

These two behaviors require different parameter interfaces. An algorithm that targets a specific count needs to accept a target ratio or count, while a cleaning algorithm only needs to know which classes to clean.

The Two Categories

Controlled Under-Samplers (BaseUnderSampler)

These algorithms reduce the majority class to a user-specified target. The sampling_strategy parameter accepts:

Type Semantics
float Desired ratio of minority to majority after resampling (alpha_us = N_minority / N_resampled_majority). Only valid for binary classification.
str Predefined strategy: 'majority', 'not minority', 'not majority', 'all', or 'auto'.
dict Explicit mapping from class labels to desired sample counts.
callable A function that takes y and returns a dict of class-to-count mappings.

Examples: RandomUnderSampler, NearMiss, InstanceHardnessThreshold.

The internal _sampling_type is set to "under-sampling".

Cleaning Samplers (BaseCleaningSampler)

These algorithms remove samples based on quality criteria (e.g., proximity to the decision boundary, inconsistency with neighbors) without guaranteeing a specific final class distribution. The sampling_strategy parameter accepts:

Type Semantics
str Predefined strategy: 'majority', 'not minority', 'not majority', 'all', or 'auto'. Unlike BaseUnderSampler, the resulting class sizes will not be equalized.
list An explicit list of classes to target for cleaning.
callable A function that takes y and returns a dict of class-to-count mappings.

Examples: TomekLinks, EditedNearestNeighbours, NeighbourhoodCleaningRule.

The internal _sampling_type is set to "clean-sampling".

Design Benefits

  1. Consistent API: Both base classes inherit from BaseSampler, ensuring all under-sampling algorithms share the same fit_resample(X, y) interface.
  2. Appropriate parameter validation: Each base class defines _parameter_constraints that match the valid types for its sampling_strategy. This prevents users from passing a float ratio to a cleaning sampler (which would be semantically meaningless).
  3. Self-documenting: Each base class carries a _sampling_strategy_docstring that is injected into subclass docstrings, ensuring accurate and consistent documentation.
  4. Clear categorization: The _sampling_type attribute makes it easy to programmatically distinguish between the two algorithm families (e.g., for pipeline validation or meta-learning).

Abstraction Structure

# imblearn/under_sampling/base.py

class BaseUnderSampler(BaseSampler):
    _sampling_type = "under-sampling"
    # Accepts: float, str, dict, callable
    _parameter_constraints: dict = {
        "sampling_strategy": [
            Interval(numbers.Real, 0, 1, closed="right"),
            StrOptions({"auto", "majority", "not minority", "not majority", "all"}),
            Mapping,
            callable,
        ],
    }

class BaseCleaningSampler(BaseSampler):
    _sampling_type = "clean-sampling"
    # Accepts: str, list, callable (no float)
    _parameter_constraints: dict = {
        "sampling_strategy": [
            Interval(numbers.Real, 0, 1, closed="right"),
            StrOptions({"auto", "majority", "not minority", "not majority", "all"}),
            list,
            callable,
        ],
    }

Key Distinction

The fundamental distinction is one of guarantees:

  • BaseUnderSampler guarantees the output class distribution will match the specified target (within the constraints of the algorithm).
  • BaseCleaningSampler makes no guarantee about the final class distribution. The number of removed samples depends on the data characteristics and the cleaning criterion, not on a user-specified target.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment