Implementation:Scikit learn contrib Imbalanced learn ValueDifferenceMetric
Implementation: ValueDifferenceMetric
ValueDifferenceMetric is a class in the imbalanced-learn library that computes pairwise distances between samples containing only categorical features using the Value Difference Metric (VDM). The distance is based on the conditional probabilities of class labels given feature values, enabling meaningful distance computations over nominal data.
Overview
| Property | Value |
|---|---|
| Class | ValueDifferenceMetric(BaseEstimator)
|
| Source | imblearn/metrics/pairwise.py (lines 1-242)
|
| Import | from imblearn.metrics.pairwise import ValueDifferenceMetric
|
| Added in | version 0.8 |
Purpose
The ValueDifferenceMetric computes the distance between samples whose features are entirely categorical. Unlike standard distance metrics (e.g., Euclidean) that operate on continuous features, VDM leverages the conditional probability distributions of class labels given each feature value to determine how "far apart" two feature values are. Feature values that lead to similar class distributions are considered close, even if they are nominally different.
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
n_categories |
"auto" or array-like of shape (n_features,) |
"auto" |
The number of unique categories per feature. If "auto", computed from X at fit time. Can also be derived from the categories_ attribute of OrdinalEncoder.
|
k |
int | 1 |
Exponent used to compute the distance between individual feature values. |
r |
int | 2 |
Exponent used to compute the distance between full feature vectors. |
Fitted Attributes
| Attribute | Type | Description |
|---|---|---|
n_categories_ |
ndarray of shape (n_features,) |
The number of categories per feature. |
proba_per_class_ |
list of ndarray of shape (n_categories, n_classes) |
Conditional probabilities for each category given a class, one array per feature. |
n_features_in_ |
int | Number of features in the input dataset. |
feature_names_in_ |
ndarray of shape (n_features_in_,) |
Names of features seen during fit. Only defined when X has string feature names.
|
Methods
fit(X, y)
Computes the conditional probability statistics required for the VDM from the training data. For each feature, it counts how often each category co-occurs with each class, then normalizes to obtain conditional probabilities.
from sklearn.preprocessing import OrdinalEncoder
from imblearn.metrics.pairwise import ValueDifferenceMetric
import numpy as np
X = np.array(["green"] * 10 + ["red"] * 10 + ["blue"] * 10).reshape(-1, 1)
y = [1] * 8 + [0] * 5 + [1] * 7 + [0] * 9 + [1]
encoder = OrdinalEncoder(dtype=np.int32)
X_encoded = encoder.fit_transform(X)
vdm = ValueDifferenceMetric(k=1, r=2)
vdm.fit(X_encoded, y)
pairwise(X, Y=None)
Computes the VDM pairwise distance matrix. If Y is None, computes pairwise distances within X. Otherwise, computes distances between rows of X and rows of Y.
pairwise_distance = vdm.pairwise(X_encoded)
print(pairwise_distance.shape)
# (30, 30)
X_test = np.array(["green", "red", "blue"]).reshape(-1, 1)
X_test_encoded = encoder.transform(X_test)
print(vdm.pairwise(X_test_encoded))
# array([[0. , 0.04, 1.96],
# [0.04, 0. , 1.44],
# [1.96, 1.44, 0. ]])
Implementation Details
The fit method performs the following steps:
- Validates parameters and input data (expects
np.int32dtype fromOrdinalEncoder). - Determines the number of categories per feature (either from
n_categoriesparameter or by computingX.max(axis=0) + 1). - For each feature and each class, counts occurrences via
np.bincount. - Normalizes counts to conditional probabilities by dividing by row sums (handling division-by-zero with
np.nan_to_num).
The pairwise method computes distances by:
- For each feature, looking up the precomputed conditional probability vectors for each sample's feature value.
- Computing the
L_kdistance matrix between those probability vectors usingscipy.spatial.distance_matrix. - Raising each per-feature distance matrix to the power
rand summing across all features.
Important Notes
- Input data must be encoded using
sklearn.preprocessing.OrdinalEncoderwithdtype=np.int32. If other dtypes are provided, the data will be cast tonp.int32. - The metric requires non-negative integer values (enforced via
ensure_non_negative=True). - The class sets the sklearn tag
positive_only = Trueto indicate this constraint.
Reference
Stanfill, Craig, and David Waltz. "Toward memory-based reasoning." Communications of the ACM 29.12 (1986): 1213-1228.