Principle:Scikit learn contrib Imbalanced learn Value Difference Metric
Principle: Value Difference Metric
The Value Difference Metric (VDM) is a distance metric designed specifically for categorical features. Unlike standard distance metrics that treat nominal values as either identical or completely different, VDM computes distances based on the statistical relationship between feature values and class labels. Feature values that lead to similar class distributions are considered close, even if they are nominally distinct.
Mathematical Formulation
Per-Feature Distance
For a single feature f, the distance between two values x and y is defined as:
delta(x, y) = sum_c |p(c|x_f) - p(c|y_f)|^k
where:
- C is the set of all classes.
- p(c|x_f) is the conditional probability that the output class is c given that feature f has the value x.
- k is an exponent, typically set to 1 or 2.
This captures the idea that two feature values are "close" if observing either value leads to a similar distribution over class labels.
Full Vector Distance
For two complete feature vectors X and Y, the distance is:
Delta(X, Y) = sum_f delta(X_f, Y_f)^r
where:
- F is the number of features.
- r is an exponent, typically set to 1 or 2.
This aggregates the per-feature distances into a single scalar distance, analogous to the Minkowski distance family but operating over categorical probability distributions rather than raw numerical values.
Intuition
Consider a medical dataset with a "Symptom" feature having values {cough, sneeze, chest_pain}. If both cough and sneeze are associated with similar class distributions (e.g., both predominantly linked to "cold"), then VDM will assign a small distance between them. In contrast, chest_pain may be associated with a very different class distribution (e.g., predominantly linked to "heart_disease"), yielding a larger distance from the other two values.
This behavior is fundamentally different from Hamming distance, which would treat all three values as equally distant from each other.
Key Properties
- Class-conditional: The distance is derived entirely from the relationship between feature values and class labels, making it inherently supervised.
- Probability-based: By working with conditional probability distributions, VDM naturally handles imbalanced category frequencies.
- Composable: The per-feature distances are aggregated via a Minkowski-like sum, allowing the metric to scale to multi-feature datasets.
- Encoding requirement: In practice, categorical features must be ordinally encoded (e.g., via
OrdinalEncoder) before computing VDM distances.
Reference
Stanfill, Craig, and David Waltz. "Toward memory-based reasoning." Communications of the ACM 29.12 (1986): 1213-1228.