Implementation:Scikit learn contrib Imbalanced learn InstanceHardnessCV

Implementation: InstanceHardnessCV

InstanceHardnessCV is a cross-validation splitter in the imbalanced-learn library that distributes hard-to-classify samples equally across folds. It estimates instance hardness via cross-validated predicted probabilities, then assigns samples to folds using a lexicographic sort on hardness and class.

Overview

Property	Value
Class	`InstanceHardnessCV(BaseCrossValidator)`
Source	`imblearn/model_selection/_split.py` (lines 1-121)
Import	`from imblearn.model_selection import InstanceHardnessCV`

Purpose

Standard cross-validation splitters (e.g., StratifiedKFold) distribute samples across folds while preserving class proportions, but they do not account for the difficulty of individual samples. This can result in folds where one fold contains many hard-to-classify samples while another contains mostly easy ones, leading to high variance in per-fold scores.

InstanceHardnessCV solves this by estimating how hard each sample is to classify (via cross-validated predict_proba), then assigning samples to folds so that each fold contains a similar proportion of easy and hard samples.

Parameters

Parameter	Type	Default	Description
`estimator`	estimator object	(required)	Classifier used to estimate instance hardness. Must implement `predict_proba`.
`n_splits`	int	`5`	Number of folds. Must be at least 2.
`pos_label`	int, float, bool, or str	`None`	The class considered the positive class when selecting the probability representing instance hardness. If `None`, defaults to `estimator.classes_[1]`.

Methods

split(X, y, groups=None)

Generates indices to split data into training and test sets. The groups parameter is ignored (a warning is emitted if provided).

from imblearn.model_selection import InstanceHardnessCV
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_validate
from sklearn.linear_model import LogisticRegression

X, y = make_classification(
    weights=[0.9, 0.1], class_sep=2,
    n_informative=3, n_redundant=1,
    flip_y=0.05, n_samples=1000, random_state=10
)

estimator = LogisticRegression()
ih_cv = InstanceHardnessCV(estimator, n_splits=5)
cv_result = cross_validate(estimator, X, y, cv=ih_cv)
print(f"Std of test scores: {cv_result['test_score'].std():.3f}")

get_n_splits(X=None, y=None, groups=None)

Returns the number of splitting iterations. All parameters are ignored; the method simply returns self.n_splits.

Implementation Details

The split method follows this algorithm:

Validate target: Checks that y is binary classification. Raises ValueError if not.
Determine positive label: If pos_label is None, defaults to index 1 among sorted unique classes. Otherwise, looks up the index of the specified class.
Estimate instance hardness: Uses sklearn.model_selection.cross_val_predict with a clone of the estimator and stratified CV to obtain predict_proba outputs for all samples.
Sort by (class, hardness): Applies np.lexsort on (y_proba[:, pos_label], y), sorting first by class label, then by predicted probability of the positive class.
Assign fold indices round-robin: Creates a groups array where groups[sorted_indices] = np.arange(n_samples) % n_splits. This distributes sorted samples evenly across folds.
Yield splits: Delegates to LeaveOneGroupOut().split(X, y, groups) to generate the actual train/test index arrays.

# Core logic inside split():
y_proba = cross_val_predict(
    clone(self.estimator), X, y, cv=self.n_splits, method="predict_proba"
)
sorted_indices = np.lexsort((y_proba[:, pos_label], y))
groups = np.empty(_num_samples(X), dtype=int)
groups[sorted_indices] = np.arange(_num_samples(X)) % self.n_splits
cv = LeaveOneGroupOut()
yield from cv.split(X, y, groups)

Important Notes

Binary classification only: The splitter raises a ValueError if y is not binary.
Estimator requirement: The provided estimator must support predict_proba.
Internal cross-validation: The method internally runs a full stratified cross-validation to estimate instance hardness, which adds computational overhead.
Groups parameter ignored: If a groups argument is passed to split, a UserWarning is emitted and the argument is discarded.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment