Implementation:Scikit learn contrib Imbalanced learn InstanceHardnessCV
Implementation: InstanceHardnessCV
InstanceHardnessCV is a cross-validation splitter in the imbalanced-learn library that distributes hard-to-classify samples equally across folds. It estimates instance hardness via cross-validated predicted probabilities, then assigns samples to folds using a lexicographic sort on hardness and class.
Overview
| Property | Value |
|---|---|
| Class | InstanceHardnessCV(BaseCrossValidator)
|
| Source | imblearn/model_selection/_split.py (lines 1-121)
|
| Import | from imblearn.model_selection import InstanceHardnessCV
|
Purpose
Standard cross-validation splitters (e.g., StratifiedKFold) distribute samples across folds while preserving class proportions, but they do not account for the difficulty of individual samples. This can result in folds where one fold contains many hard-to-classify samples while another contains mostly easy ones, leading to high variance in per-fold scores.
InstanceHardnessCV solves this by estimating how hard each sample is to classify (via cross-validated predict_proba), then assigning samples to folds so that each fold contains a similar proportion of easy and hard samples.
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
estimator |
estimator object | (required) | Classifier used to estimate instance hardness. Must implement predict_proba.
|
n_splits |
int | 5 |
Number of folds. Must be at least 2. |
pos_label |
int, float, bool, or str | None |
The class considered the positive class when selecting the probability representing instance hardness. If None, defaults to estimator.classes_[1].
|
Methods
split(X, y, groups=None)
Generates indices to split data into training and test sets. The groups parameter is ignored (a warning is emitted if provided).
from imblearn.model_selection import InstanceHardnessCV
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_validate
from sklearn.linear_model import LogisticRegression
X, y = make_classification(
weights=[0.9, 0.1], class_sep=2,
n_informative=3, n_redundant=1,
flip_y=0.05, n_samples=1000, random_state=10
)
estimator = LogisticRegression()
ih_cv = InstanceHardnessCV(estimator, n_splits=5)
cv_result = cross_validate(estimator, X, y, cv=ih_cv)
print(f"Std of test scores: {cv_result['test_score'].std():.3f}")
get_n_splits(X=None, y=None, groups=None)
Returns the number of splitting iterations. All parameters are ignored; the method simply returns self.n_splits.
Implementation Details
The split method follows this algorithm:
- Validate target: Checks that
yis binary classification. RaisesValueErrorif not. - Determine positive label: If
pos_labelisNone, defaults to index1among sorted unique classes. Otherwise, looks up the index of the specified class. - Estimate instance hardness: Uses
sklearn.model_selection.cross_val_predictwith a clone of the estimator and stratified CV to obtainpredict_probaoutputs for all samples. - Sort by (class, hardness): Applies
np.lexsorton(y_proba[:, pos_label], y), sorting first by class label, then by predicted probability of the positive class. - Assign fold indices round-robin: Creates a groups array where
groups[sorted_indices] = np.arange(n_samples) % n_splits. This distributes sorted samples evenly across folds. - Yield splits: Delegates to
LeaveOneGroupOut().split(X, y, groups)to generate the actual train/test index arrays.
# Core logic inside split():
y_proba = cross_val_predict(
clone(self.estimator), X, y, cv=self.n_splits, method="predict_proba"
)
sorted_indices = np.lexsort((y_proba[:, pos_label], y))
groups = np.empty(_num_samples(X), dtype=int)
groups[sorted_indices] = np.arange(_num_samples(X)) % self.n_splits
cv = LeaveOneGroupOut()
yield from cv.split(X, y, groups)
Important Notes
- Binary classification only: The splitter raises a
ValueErrorifyis not binary. - Estimator requirement: The provided estimator must support
predict_proba. - Internal cross-validation: The method internally runs a full stratified cross-validation to estimate instance hardness, which adds computational overhead.
- Groups parameter ignored: If a
groupsargument is passed tosplit, aUserWarningis emitted and the argument is discarded.