Implementation:Scikit learn contrib Imbalanced learn InstanceHardnessThreshold
| Knowledge Sources | |
|---|---|
| Domains | Machine_Learning, Data_Preprocessing, Imbalanced_Learning |
| Last Updated | 2026-02-09 03:00 GMT |
Overview
Concrete tool for under-sampling based on instance hardness thresholding provided by the imbalanced-learn library.
Description
The InstanceHardnessThreshold class implements an under-sampling strategy that removes samples which are hard to classify. It extends BaseUnderSampler and uses cross-validated predictions from a classifier to estimate instance hardness (the probability of correct classification for each sample). Samples from majority classes whose predicted probability of correct classification falls below a computed percentile threshold are removed, retaining only those samples that the classifier can reliably classify.
Usage
Import this class when you want to under-sample majority classes by removing noisy or ambiguous samples that lie near decision boundaries or in overlapping class regions, rather than randomly discarding majority instances.
Code Reference
Source Location
- Repository: imbalanced-learn
- File: imblearn/under_sampling/_prototype_selection/_instance_hardness_threshold.py
- Lines: L1-209
Signature
class InstanceHardnessThreshold(BaseUnderSampler):
def __init__(
self,
*,
estimator=None,
sampling_strategy="auto",
random_state=None,
cv=5,
n_jobs=None,
):
"""
Args:
estimator: estimator object or None - Classifier used to estimate
instance hardness. Must implement predict_proba. Defaults to
RandomForestClassifier(n_estimators=100) when None.
sampling_strategy: str, dict, or callable - Desired ratio of
minority to majority samples. 'auto' equalizes all classes.
random_state: int, RandomState, or None - Seed for reproducibility.
cv: int - Number of cross-validation folds used to estimate
instance hardness (default: 5).
n_jobs: int or None - Number of parallel jobs for cross-validation
and the default estimator.
"""
Import
from imblearn.under_sampling import InstanceHardnessThreshold
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| X | {array-like, sparse matrix, dataframe} of shape (n_samples, n_features) | Yes | Feature matrix of training data |
| y | array-like of shape (n_samples,) | Yes | Target labels |
| estimator | estimator object or None | No | Classifier with predict_proba (default: RandomForestClassifier) |
| sampling_strategy | str, dict, or callable | No | Resampling ratio (default: 'auto') |
| random_state | int, RandomState, or None | No | Random seed |
| cv | int | No | Number of cross-validation folds (default: 5) |
| n_jobs | int or None | No | Number of parallel jobs (default: None) |
Outputs
| Name | Type | Description |
|---|---|---|
| X_resampled | {ndarray, sparse matrix, dataframe} of shape (n_samples_new, n_features) | Feature matrix with hard-to-classify majority samples removed |
| y_resampled | ndarray of shape (n_samples_new,) | Target array after under-sampling |
Key Attributes After Fitting
| Attribute | Type | Description |
|---|---|---|
| sampling_strategy_ | dict | Maps class labels to number of samples to retain |
| estimator_ | estimator object | The validated classifier used for hardness estimation |
| sample_indices_ | ndarray of shape (n_new_samples,) | Indices of samples selected from the original dataset |
| n_features_in_ | int | Number of features in the input dataset |
| feature_names_in_ | ndarray of shape (n_features_in_,) | Names of features seen during fit (when X has string feature names) |
Usage Examples
Basic Usage
from collections import Counter
from sklearn.datasets import make_classification
from imblearn.under_sampling import InstanceHardnessThreshold
# Create imbalanced dataset
X, y = make_classification(
n_classes=2, class_sep=2, weights=[0.1, 0.9],
n_informative=3, n_redundant=1, flip_y=0,
n_features=20, n_clusters_per_class=1,
n_samples=1000, random_state=10,
)
print(f"Original: {Counter(y)}")
# Original: Counter({1: 900, 0: 100})
# Apply InstanceHardnessThreshold
iht = InstanceHardnessThreshold(random_state=42)
X_res, y_res = iht.fit_resample(X, y)
print(f"Resampled: {Counter(y_res)}")
# Resampled: Counter({1: 5xx, 0: 100})
Custom Estimator
from sklearn.linear_model import LogisticRegression
from imblearn.under_sampling import InstanceHardnessThreshold
# Use logistic regression instead of default random forest
iht = InstanceHardnessThreshold(
estimator=LogisticRegression(max_iter=1000),
cv=10,
random_state=42,
)
X_res, y_res = iht.fit_resample(X, y)
In a Pipeline
from imblearn.pipeline import make_pipeline
from imblearn.under_sampling import InstanceHardnessThreshold
from sklearn.tree import DecisionTreeClassifier
pipeline = make_pipeline(
InstanceHardnessThreshold(random_state=42),
DecisionTreeClassifier(),
)
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)