Principle:Scikit learn contrib Imbalanced learn Under Sampling Base Abstraction
Principle: Under-Sampling Base Abstraction
The Under-Sampling Base Abstraction defines a design pattern for organizing under-sampling algorithms into two distinct categories based on their resampling semantics. This abstraction ensures a consistent API across all under-sampling implementations in the imbalanced-learn library while accurately reflecting the fundamental difference between controlled resampling and cleaning-based resampling.
Motivation
Under-sampling algorithms aim to address class imbalance by reducing the number of samples in the majority class (or classes). However, not all under-sampling algorithms work the same way:
- Some algorithms reduce the majority class to a specific target count (e.g., "make the majority class the same size as the minority class").
- Other algorithms remove noisy or borderline samples without guaranteeing any particular final count (e.g., "remove all samples that are misclassified by their nearest neighbors").
These two behaviors require different parameter interfaces. An algorithm that targets a specific count needs to accept a target ratio or count, while a cleaning algorithm only needs to know which classes to clean.
The Two Categories
Controlled Under-Samplers (BaseUnderSampler)
These algorithms reduce the majority class to a user-specified target. The sampling_strategy parameter accepts:
| Type | Semantics |
|---|---|
| float | Desired ratio of minority to majority after resampling (alpha_us = N_minority / N_resampled_majority). Only valid for binary classification.
|
| str | Predefined strategy: 'majority', 'not minority', 'not majority', 'all', or 'auto'.
|
| dict | Explicit mapping from class labels to desired sample counts. |
| callable | A function that takes y and returns a dict of class-to-count mappings.
|
Examples: RandomUnderSampler, NearMiss, InstanceHardnessThreshold.
The internal _sampling_type is set to "under-sampling".
Cleaning Samplers (BaseCleaningSampler)
These algorithms remove samples based on quality criteria (e.g., proximity to the decision boundary, inconsistency with neighbors) without guaranteeing a specific final class distribution. The sampling_strategy parameter accepts:
| Type | Semantics |
|---|---|
| str | Predefined strategy: 'majority', 'not minority', 'not majority', 'all', or 'auto'. Unlike BaseUnderSampler, the resulting class sizes will not be equalized.
|
| list | An explicit list of classes to target for cleaning. |
| callable | A function that takes y and returns a dict of class-to-count mappings.
|
Examples: TomekLinks, EditedNearestNeighbours, NeighbourhoodCleaningRule.
The internal _sampling_type is set to "clean-sampling".
Design Benefits
- Consistent API: Both base classes inherit from
BaseSampler, ensuring all under-sampling algorithms share the samefit_resample(X, y)interface. - Appropriate parameter validation: Each base class defines
_parameter_constraintsthat match the valid types for itssampling_strategy. This prevents users from passing a float ratio to a cleaning sampler (which would be semantically meaningless). - Self-documenting: Each base class carries a
_sampling_strategy_docstringthat is injected into subclass docstrings, ensuring accurate and consistent documentation. - Clear categorization: The
_sampling_typeattribute makes it easy to programmatically distinguish between the two algorithm families (e.g., for pipeline validation or meta-learning).
Abstraction Structure
# imblearn/under_sampling/base.py
class BaseUnderSampler(BaseSampler):
_sampling_type = "under-sampling"
# Accepts: float, str, dict, callable
_parameter_constraints: dict = {
"sampling_strategy": [
Interval(numbers.Real, 0, 1, closed="right"),
StrOptions({"auto", "majority", "not minority", "not majority", "all"}),
Mapping,
callable,
],
}
class BaseCleaningSampler(BaseSampler):
_sampling_type = "clean-sampling"
# Accepts: str, list, callable (no float)
_parameter_constraints: dict = {
"sampling_strategy": [
Interval(numbers.Real, 0, 1, closed="right"),
StrOptions({"auto", "majority", "not minority", "not majority", "all"}),
list,
callable,
],
}
Key Distinction
The fundamental distinction is one of guarantees:
- BaseUnderSampler guarantees the output class distribution will match the specified target (within the constraints of the algorithm).
- BaseCleaningSampler makes no guarantee about the final class distribution. The number of removed samples depends on the data characteristics and the cleaning criterion, not on a user-specified target.