Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Haifengl Smile Point Data Preparation

From Leeroopedia


Overview

Point Data Preparation is the foundational step in any nearest neighbor search workflow using the Smile library. All spatial index structures in the smile.neighbor package operate on points embedded in metric spaces. Before constructing an index, data must be formatted as uniform-dimensional numeric vectors -- specifically as double[][] arrays for Euclidean-space structures, or as typed arrays with an associated Metric for general metric-space structures.

Each row in the double[][] array represents a single point, and each column represents a dimension (feature). Every point must have the same number of dimensions. Violating this invariant will produce undefined behavior or runtime exceptions during index construction.

Theoretical Basis

Metric Spaces and Vector Representation

Spatial index structures are designed to accelerate proximity queries in metric spaces -- mathematical spaces equipped with a distance function d(x, y) satisfying:

  1. Non-negativity: d(x, y) >= 0
  2. Identity of indiscernibles: d(x, y) = 0 if and only if x = y
  3. Symmetry: d(x, y) = d(y, x)
  4. Triangle inequality: d(x, z) <= d(x, y) + d(y, z)

The most common embedding is Euclidean space (R^k), where points are represented as k-dimensional real-valued vectors. Structures such as KDTree, LSH, MPLSH, and RandomProjectionForest all assume Euclidean space and require double[][] input.

General metric-space structures (CoverTree, BKTree, LinearSearch) accept arbitrary key types with user-supplied distance functions, but the data must still be organized as arrays or lists of uniform type.

The Curse of Dimensionality

The effectiveness of spatial index structures degrades as dimensionality increases. In high-dimensional spaces:

  • The volume of space grows exponentially, making data increasingly sparse.
  • Distance concentration occurs: the ratio of the distance to the nearest neighbor versus the farthest neighbor approaches 1.
  • Tree-based exact methods (KD-tree) degenerate to linear scan when dimensionality exceeds roughly 20.

This means the dimensionality of your prepared data directly determines which index structures are viable. As noted in the Smile KDTree Javadoc: "if the dimensionality is D, then number of points in the dataset, N, should be N >> 2^D".

Feature Normalization

When features have different scales (e.g., age in [0, 100] vs. income in [0, 1000000]), Euclidean distance will be dominated by the high-magnitude features. Normalization strategies include:

  • Min-max scaling: Transform each feature to [0, 1].
  • Z-score standardization: Transform each feature to zero mean and unit variance.
  • Unit vector normalization: Scale each point to unit length (useful for angular distance).

Normalization should be applied before constructing the spatial index, as the index captures the geometric structure of the data at construction time.

Data Format Requirements

Euclidean-Space Structures

Structure Key Type Data Format Notes
KDTree double[] double[][] All rows must have identical length
LSH double[] double[][] Dimensionality d >= 2 enforced
MPLSH double[] double[][] Extends LSH; same constraints
RandomProjectionForest double[] double[][] Supports Euclidean and angular distance

General Metric-Space Structures

Structure Key Type Distance Requirement Notes
CoverTree Generic K Metric<K> Works with any metric space
BKTree Generic K Metric<K> (discrete) Requires integer-valued distances (edit distance, Hamming, etc.)
LinearSearch Generic K Distance<K> Most general; any distance function

Key Considerations

  • Consistent dimensionality: Every point in a double[][] array must have the same number of elements. The library infers dimensionality from keys[0].length during construction.
  • No null entries: Null elements within the data array will cause NullPointerException during index construction.
  • Matching key and data arrays: When using constructors that accept separate key and data arrays, their lengths must match exactly. The library validates this with an IllegalArgumentException.
  • Reference equality for exclusion: Smile uses reference equality (==) to exclude the query object from results. When searching with an object from the indexed dataset, the query itself will be automatically excluded.

Workflow Position

Point Data Preparation is Step 1 of the Nearest Neighbor Search workflow:

  1. Point Data Preparation (this principle)
  2. Spatial Index Selection
  3. Spatial Index Construction
  4. Nearest Neighbor Query
  5. Neighbor Result Processing

Related

Categories

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment