Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Scikit learn contrib Imbalanced learn ValueDifferenceMetric

From Leeroopedia


Implementation: ValueDifferenceMetric

ValueDifferenceMetric is a class in the imbalanced-learn library that computes pairwise distances between samples containing only categorical features using the Value Difference Metric (VDM). The distance is based on the conditional probabilities of class labels given feature values, enabling meaningful distance computations over nominal data.

Overview

Property Value
Class ValueDifferenceMetric(BaseEstimator)
Source imblearn/metrics/pairwise.py (lines 1-242)
Import from imblearn.metrics.pairwise import ValueDifferenceMetric
Added in version 0.8

Purpose

The ValueDifferenceMetric computes the distance between samples whose features are entirely categorical. Unlike standard distance metrics (e.g., Euclidean) that operate on continuous features, VDM leverages the conditional probability distributions of class labels given each feature value to determine how "far apart" two feature values are. Feature values that lead to similar class distributions are considered close, even if they are nominally different.

Parameters

Parameter Type Default Description
n_categories "auto" or array-like of shape (n_features,) "auto" The number of unique categories per feature. If "auto", computed from X at fit time. Can also be derived from the categories_ attribute of OrdinalEncoder.
k int 1 Exponent used to compute the distance between individual feature values.
r int 2 Exponent used to compute the distance between full feature vectors.

Fitted Attributes

Attribute Type Description
n_categories_ ndarray of shape (n_features,) The number of categories per feature.
proba_per_class_ list of ndarray of shape (n_categories, n_classes) Conditional probabilities for each category given a class, one array per feature.
n_features_in_ int Number of features in the input dataset.
feature_names_in_ ndarray of shape (n_features_in_,) Names of features seen during fit. Only defined when X has string feature names.

Methods

fit(X, y)

Computes the conditional probability statistics required for the VDM from the training data. For each feature, it counts how often each category co-occurs with each class, then normalizes to obtain conditional probabilities.

from sklearn.preprocessing import OrdinalEncoder
from imblearn.metrics.pairwise import ValueDifferenceMetric
import numpy as np

X = np.array(["green"] * 10 + ["red"] * 10 + ["blue"] * 10).reshape(-1, 1)
y = [1] * 8 + [0] * 5 + [1] * 7 + [0] * 9 + [1]

encoder = OrdinalEncoder(dtype=np.int32)
X_encoded = encoder.fit_transform(X)

vdm = ValueDifferenceMetric(k=1, r=2)
vdm.fit(X_encoded, y)

pairwise(X, Y=None)

Computes the VDM pairwise distance matrix. If Y is None, computes pairwise distances within X. Otherwise, computes distances between rows of X and rows of Y.

pairwise_distance = vdm.pairwise(X_encoded)
print(pairwise_distance.shape)
# (30, 30)

X_test = np.array(["green", "red", "blue"]).reshape(-1, 1)
X_test_encoded = encoder.transform(X_test)
print(vdm.pairwise(X_test_encoded))
# array([[0.  , 0.04, 1.96],
#        [0.04, 0.  , 1.44],
#        [1.96, 1.44, 0.  ]])

Implementation Details

The fit method performs the following steps:

  1. Validates parameters and input data (expects np.int32 dtype from OrdinalEncoder).
  2. Determines the number of categories per feature (either from n_categories parameter or by computing X.max(axis=0) + 1).
  3. For each feature and each class, counts occurrences via np.bincount.
  4. Normalizes counts to conditional probabilities by dividing by row sums (handling division-by-zero with np.nan_to_num).

The pairwise method computes distances by:

  1. For each feature, looking up the precomputed conditional probability vectors for each sample's feature value.
  2. Computing the L_k distance matrix between those probability vectors using scipy.spatial.distance_matrix.
  3. Raising each per-feature distance matrix to the power r and summing across all features.

Important Notes

  • Input data must be encoded using sklearn.preprocessing.OrdinalEncoder with dtype=np.int32. If other dtypes are provided, the data will be cast to np.int32.
  • The metric requires non-negative integer values (enforced via ensure_non_negative=True).
  • The class sets the sklearn tag positive_only = True to indicate this constraint.

Reference

Stanfill, Craig, and David Waltz. "Toward memory-based reasoning." Communications of the ACM 29.12 (1986): 1213-1228.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment