Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Cleanlab Cleanlab KNN Distance Metric Selection

From Leeroopedia




Knowledge Sources
Domains Outlier_Detection, Nearest_Neighbors
Last Updated 2026-02-09 19:30 GMT

Overview

Automatic distance metric selection for KNN-based operations (outlier detection, duplicate detection, non-IID testing) based on feature dimensionality and dataset size.

Description

Cleanlab uses k-nearest neighbors extensively for outlier detection, near-duplicate detection, non-IID testing, and data valuation. The choice of distance metric significantly affects the quality of neighbor searches, and the optimal metric depends on the dimensionality of the feature space. This heuristic automates the metric selection: cosine distance for high-dimensional features (> 3 dimensions) and euclidean distance for low-dimensional features, with an additional optimization choosing between scipy (more precise) and sklearn (faster) euclidean implementations based on dataset size.

Usage

This heuristic is applied automatically whenever Datalab.find_issues computes KNN graphs from features, or when using the OutOfDistribution scorer. Understanding it helps when:

  • Providing pre-computed features to Datalab
  • Debugging unexpected outlier or duplicate detection results
  • Deciding whether to override the default metric

The Insight (Rule of Thumb)

  • Action: Select distance metric based on feature dimensionality:
    • If number of feature columns > 3: use cosine distance
    • If number of feature columns <= 3: use euclidean distance
  • Sub-action: For euclidean distance, select implementation based on dataset size:
    • If number of rows > 100: use sklearn's `"euclidean"` string metric (faster)
    • If number of rows <= 100: use scipy's `euclidean` callable (more numerically precise)
  • Trade-off: Cosine distance ignores magnitude and focuses on direction, which is more meaningful in high-dimensional spaces where euclidean distances become less discriminative (the "curse of dimensionality"). Euclidean distance preserves magnitude information, which is important in low-dimensional spaces.

Reasoning

In high-dimensional spaces (e.g., embedding vectors from neural networks, which are typically 128-4096 dimensions), euclidean distance becomes less discriminative because all pairwise distances tend to converge. Cosine similarity measures the angle between vectors, which remains meaningful regardless of dimensionality.

The cutoff of 3 dimensions is chosen conservatively: for 1D, 2D, and 3D feature spaces, euclidean distance has clear geometric meaning. Above 3 dimensions, the curse of dimensionality begins to affect euclidean distances.

The implementation split between scipy and sklearn for euclidean distance reflects a practical trade-off: sklearn's implementation is optimized for batch operations on larger datasets using BLAS routines, while scipy's pairwise `euclidean` function has better numerical stability for small datasets where precision matters more than speed.

Code Evidence:

Dimension-based metric selection from `cleanlab/internal/neighbor/metric.py:5-11`:

HIGH_DIMENSION_CUTOFF: int = 3
"""
If the number of columns (M) in the `features` array is greater than
this cutoff value, then by default, K-nearest-neighbors will use the
"cosine" metric. The cosine metric is more suitable for high-dimensional
data. Otherwise the "euclidean" distance will be used.
"""

Dataset size-based euclidean implementation from `cleanlab/internal/neighbor/metric.py:13-21`:

ROW_COUNT_CUTOFF: int = 100
"""
Only affects settings where Euclidean metrics would be used by default.
If the number of rows (N) in the `features` array is greater than this
cutoff value, then by default, Euclidean distances are computed via the
"euclidean" metric (implemented in sklearn for efficiency reasons).
Otherwise, Euclidean distances are by default computed via the
``euclidean`` metric from scipy (slower but numerically more precise).
"""

Main decision function from `cleanlab/internal/neighbor/metric.py:74-107`:

def decide_default_metric(features: FeatureArray) -> Metric:
    if features.shape[1] > HIGH_DIMENSION_CUTOFF:
        return _cosine_metric()
    return decide_euclidean_metric(features)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment