Implementation:Scikit learn contrib Imbalanced learn ValueDifferenceMetric

Implementation: ValueDifferenceMetric

ValueDifferenceMetric is a class in the imbalanced-learn library that computes pairwise distances between samples containing only categorical features using the Value Difference Metric (VDM). The distance is based on the conditional probabilities of class labels given feature values, enabling meaningful distance computations over nominal data.

Overview

Property	Value
Class	`ValueDifferenceMetric(BaseEstimator)`
Source	`imblearn/metrics/pairwise.py` (lines 1-242)
Import	`from imblearn.metrics.pairwise import ValueDifferenceMetric`
Added in	version 0.8

Purpose

The ValueDifferenceMetric computes the distance between samples whose features are entirely categorical. Unlike standard distance metrics (e.g., Euclidean) that operate on continuous features, VDM leverages the conditional probability distributions of class labels given each feature value to determine how "far apart" two feature values are. Feature values that lead to similar class distributions are considered close, even if they are nominally different.

Parameters

Parameter	Type	Default	Description
`n_categories`	`"auto"` or array-like of shape `(n_features,)`	`"auto"`	The number of unique categories per feature. If `"auto"`, computed from `X` at fit time. Can also be derived from the `categories_` attribute of `OrdinalEncoder`.
`k`	int	`1`	Exponent used to compute the distance between individual feature values.
`r`	int	`2`	Exponent used to compute the distance between full feature vectors.

Fitted Attributes

Attribute	Type	Description
`n_categories_`	ndarray of shape `(n_features,)`	The number of categories per feature.
`proba_per_class_`	list of ndarray of shape `(n_categories, n_classes)`	Conditional probabilities for each category given a class, one array per feature.
`n_features_in_`	int	Number of features in the input dataset.
`feature_names_in_`	ndarray of shape `(n_features_in_,)`	Names of features seen during `fit`. Only defined when `X` has string feature names.

Methods

fit(X, y)

Computes the conditional probability statistics required for the VDM from the training data. For each feature, it counts how often each category co-occurs with each class, then normalizes to obtain conditional probabilities.

from sklearn.preprocessing import OrdinalEncoder
from imblearn.metrics.pairwise import ValueDifferenceMetric
import numpy as np

X = np.array(["green"] * 10 + ["red"] * 10 + ["blue"] * 10).reshape(-1, 1)
y = [1] * 8 + [0] * 5 + [1] * 7 + [0] * 9 + [1]

encoder = OrdinalEncoder(dtype=np.int32)
X_encoded = encoder.fit_transform(X)

vdm = ValueDifferenceMetric(k=1, r=2)
vdm.fit(X_encoded, y)

pairwise(X, Y=None)

Computes the VDM pairwise distance matrix. If Y is None, computes pairwise distances within X. Otherwise, computes distances between rows of X and rows of Y.

pairwise_distance = vdm.pairwise(X_encoded)
print(pairwise_distance.shape)
# (30, 30)

X_test = np.array(["green", "red", "blue"]).reshape(-1, 1)
X_test_encoded = encoder.transform(X_test)
print(vdm.pairwise(X_test_encoded))
# array([[0.  , 0.04, 1.96],
#        [0.04, 0.  , 1.44],
#        [1.96, 1.44, 0.  ]])

Implementation Details

The fit method performs the following steps:

Validates parameters and input data (expects np.int32 dtype from OrdinalEncoder).
Determines the number of categories per feature (either from n_categories parameter or by computing X.max(axis=0) + 1).
For each feature and each class, counts occurrences via np.bincount.
Normalizes counts to conditional probabilities by dividing by row sums (handling division-by-zero with np.nan_to_num).

The pairwise method computes distances by:

For each feature, looking up the precomputed conditional probability vectors for each sample's feature value.
Computing the L_k distance matrix between those probability vectors using scipy.spatial.distance_matrix.
Raising each per-feature distance matrix to the power r and summing across all features.

Important Notes

Input data must be encoded using sklearn.preprocessing.OrdinalEncoder with dtype=np.int32. If other dtypes are provided, the data will be cast to np.int32.
The metric requires non-negative integer values (enforced via ensure_non_negative=True).
The class sets the sklearn tag positive_only = True to indicate this constraint.

Reference

Stanfill, Craig, and David Waltz. "Toward memory-based reasoning." Communications of the ACM 29.12 (1986): 1213-1228.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment