Principle:Haifengl Smile Point Data Preparation
Overview
Point Data Preparation is the foundational step in any nearest neighbor search workflow using the Smile library. All spatial index structures in the smile.neighbor package operate on points embedded in metric spaces. Before constructing an index, data must be formatted as uniform-dimensional numeric vectors -- specifically as double[][] arrays for Euclidean-space structures, or as typed arrays with an associated Metric for general metric-space structures.
Each row in the double[][] array represents a single point, and each column represents a dimension (feature). Every point must have the same number of dimensions. Violating this invariant will produce undefined behavior or runtime exceptions during index construction.
Theoretical Basis
Metric Spaces and Vector Representation
Spatial index structures are designed to accelerate proximity queries in metric spaces -- mathematical spaces equipped with a distance function d(x, y) satisfying:
- Non-negativity: d(x, y) >= 0
- Identity of indiscernibles: d(x, y) = 0 if and only if x = y
- Symmetry: d(x, y) = d(y, x)
- Triangle inequality: d(x, z) <= d(x, y) + d(y, z)
The most common embedding is Euclidean space (R^k), where points are represented as k-dimensional real-valued vectors. Structures such as KDTree, LSH, MPLSH, and RandomProjectionForest all assume Euclidean space and require double[][] input.
General metric-space structures (CoverTree, BKTree, LinearSearch) accept arbitrary key types with user-supplied distance functions, but the data must still be organized as arrays or lists of uniform type.
The Curse of Dimensionality
The effectiveness of spatial index structures degrades as dimensionality increases. In high-dimensional spaces:
- The volume of space grows exponentially, making data increasingly sparse.
- Distance concentration occurs: the ratio of the distance to the nearest neighbor versus the farthest neighbor approaches 1.
- Tree-based exact methods (KD-tree) degenerate to linear scan when dimensionality exceeds roughly 20.
This means the dimensionality of your prepared data directly determines which index structures are viable. As noted in the Smile KDTree Javadoc: "if the dimensionality is D, then number of points in the dataset, N, should be N >> 2^D".
Feature Normalization
When features have different scales (e.g., age in [0, 100] vs. income in [0, 1000000]), Euclidean distance will be dominated by the high-magnitude features. Normalization strategies include:
- Min-max scaling: Transform each feature to [0, 1].
- Z-score standardization: Transform each feature to zero mean and unit variance.
- Unit vector normalization: Scale each point to unit length (useful for angular distance).
Normalization should be applied before constructing the spatial index, as the index captures the geometric structure of the data at construction time.
Data Format Requirements
Euclidean-Space Structures
| Structure | Key Type | Data Format | Notes |
|---|---|---|---|
KDTree |
double[] |
double[][] |
All rows must have identical length |
LSH |
double[] |
double[][] |
Dimensionality d >= 2 enforced |
MPLSH |
double[] |
double[][] |
Extends LSH; same constraints |
RandomProjectionForest |
double[] |
double[][] |
Supports Euclidean and angular distance |
General Metric-Space Structures
| Structure | Key Type | Distance Requirement | Notes |
|---|---|---|---|
CoverTree |
Generic K |
Metric<K> |
Works with any metric space |
BKTree |
Generic K |
Metric<K> (discrete) |
Requires integer-valued distances (edit distance, Hamming, etc.) |
LinearSearch |
Generic K |
Distance<K> |
Most general; any distance function |
Key Considerations
- Consistent dimensionality: Every point in a
double[][]array must have the same number of elements. The library infers dimensionality fromkeys[0].lengthduring construction. - No null entries: Null elements within the data array will cause
NullPointerExceptionduring index construction. - Matching key and data arrays: When using constructors that accept separate key and data arrays, their lengths must match exactly. The library validates this with an
IllegalArgumentException. - Reference equality for exclusion: Smile uses reference equality (
==) to exclude the query object from results. When searching with an object from the indexed dataset, the query itself will be automatically excluded.
Workflow Position
Point Data Preparation is Step 1 of the Nearest Neighbor Search workflow:
- Point Data Preparation (this principle)
- Spatial Index Selection
- Spatial Index Construction
- Nearest Neighbor Query
- Neighbor Result Processing