Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Scikit learn contrib Imbalanced learn InstanceHardnessThreshold

From Leeroopedia


Knowledge Sources
Domains Machine_Learning, Data_Preprocessing, Imbalanced_Learning
Last Updated 2026-02-09 03:00 GMT

Overview

Concrete tool for under-sampling based on instance hardness thresholding provided by the imbalanced-learn library.

Description

The InstanceHardnessThreshold class implements an under-sampling strategy that removes samples which are hard to classify. It extends BaseUnderSampler and uses cross-validated predictions from a classifier to estimate instance hardness (the probability of correct classification for each sample). Samples from majority classes whose predicted probability of correct classification falls below a computed percentile threshold are removed, retaining only those samples that the classifier can reliably classify.

Usage

Import this class when you want to under-sample majority classes by removing noisy or ambiguous samples that lie near decision boundaries or in overlapping class regions, rather than randomly discarding majority instances.

Code Reference

Source Location

  • Repository: imbalanced-learn
  • File: imblearn/under_sampling/_prototype_selection/_instance_hardness_threshold.py
  • Lines: L1-209

Signature

class InstanceHardnessThreshold(BaseUnderSampler):
    def __init__(
        self,
        *,
        estimator=None,
        sampling_strategy="auto",
        random_state=None,
        cv=5,
        n_jobs=None,
    ):
        """
        Args:
            estimator: estimator object or None - Classifier used to estimate
                instance hardness. Must implement predict_proba. Defaults to
                RandomForestClassifier(n_estimators=100) when None.
            sampling_strategy: str, dict, or callable - Desired ratio of
                minority to majority samples. 'auto' equalizes all classes.
            random_state: int, RandomState, or None - Seed for reproducibility.
            cv: int - Number of cross-validation folds used to estimate
                instance hardness (default: 5).
            n_jobs: int or None - Number of parallel jobs for cross-validation
                and the default estimator.
        """

Import

from imblearn.under_sampling import InstanceHardnessThreshold

I/O Contract

Inputs

Name Type Required Description
X {array-like, sparse matrix, dataframe} of shape (n_samples, n_features) Yes Feature matrix of training data
y array-like of shape (n_samples,) Yes Target labels
estimator estimator object or None No Classifier with predict_proba (default: RandomForestClassifier)
sampling_strategy str, dict, or callable No Resampling ratio (default: 'auto')
random_state int, RandomState, or None No Random seed
cv int No Number of cross-validation folds (default: 5)
n_jobs int or None No Number of parallel jobs (default: None)

Outputs

Name Type Description
X_resampled {ndarray, sparse matrix, dataframe} of shape (n_samples_new, n_features) Feature matrix with hard-to-classify majority samples removed
y_resampled ndarray of shape (n_samples_new,) Target array after under-sampling

Key Attributes After Fitting

Attribute Type Description
sampling_strategy_ dict Maps class labels to number of samples to retain
estimator_ estimator object The validated classifier used for hardness estimation
sample_indices_ ndarray of shape (n_new_samples,) Indices of samples selected from the original dataset
n_features_in_ int Number of features in the input dataset
feature_names_in_ ndarray of shape (n_features_in_,) Names of features seen during fit (when X has string feature names)

Usage Examples

Basic Usage

from collections import Counter
from sklearn.datasets import make_classification
from imblearn.under_sampling import InstanceHardnessThreshold

# Create imbalanced dataset
X, y = make_classification(
    n_classes=2, class_sep=2, weights=[0.1, 0.9],
    n_informative=3, n_redundant=1, flip_y=0,
    n_features=20, n_clusters_per_class=1,
    n_samples=1000, random_state=10,
)
print(f"Original: {Counter(y)}")
# Original: Counter({1: 900, 0: 100})

# Apply InstanceHardnessThreshold
iht = InstanceHardnessThreshold(random_state=42)
X_res, y_res = iht.fit_resample(X, y)
print(f"Resampled: {Counter(y_res)}")
# Resampled: Counter({1: 5xx, 0: 100})

Custom Estimator

from sklearn.linear_model import LogisticRegression
from imblearn.under_sampling import InstanceHardnessThreshold

# Use logistic regression instead of default random forest
iht = InstanceHardnessThreshold(
    estimator=LogisticRegression(max_iter=1000),
    cv=10,
    random_state=42,
)
X_res, y_res = iht.fit_resample(X, y)

In a Pipeline

from imblearn.pipeline import make_pipeline
from imblearn.under_sampling import InstanceHardnessThreshold
from sklearn.tree import DecisionTreeClassifier

pipeline = make_pipeline(
    InstanceHardnessThreshold(random_state=42),
    DecisionTreeClassifier(),
)
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment