Principle:Haifengl Smile Spatial Index Selection
Overview
Spatial Index Selection is the process of choosing the right spatial index structure for a nearest neighbor search task based on data characteristics (dimensionality, size, metric type) and application requirements (exact vs. approximate, query type, latency constraints). The smile.neighbor package provides seven index structures, each with distinct algorithmic properties, complexity bounds, and applicability constraints.
Selecting the wrong index structure can result in orders-of-magnitude performance degradation or incorrect results. This principle provides a systematic decision framework grounded in the theoretical properties of each data structure.
Decision Framework
Primary Decision Factors
| Factor | Impact | Key Thresholds |
|---|---|---|
| Dimensionality (d) | Determines viability of tree-based structures | d < 20 for KD-tree; d > 20 favors LSH/MPLSH/RPForest |
| Dataset size (n) | Affects build time and query amortization | Small n (< 1000) may not justify complex structures |
| Distance metric | Constrains available structures | Euclidean: all structures; general metric: CoverTree/LinearSearch; discrete: BKTree |
| Exact vs. approximate | Determines if LSH-family is acceptable | Approximate methods trade accuracy for speed in high dimensions |
| Query type | KNN vs. range (RNN) vs. both | BKTree supports only RNN; RPForest supports only KNN |
| Update frequency | Static vs. dynamic dataset | LSH/MPLSH support incremental insertion via put()
|
Decision Matrix
| Structure | Dimensionality | Exactness | Distance | KNN | RNN | Build Complexity | Query Complexity |
|---|---|---|---|---|---|---|---|
| KDTree | Low (d < 20) | Exact | Euclidean | Yes | Yes | O(n log n) | O(n^(1-1/d) + k) |
| CoverTree | Any | Exact | Any metric | Yes | Yes | O(c^6 n log n) | O(c^12 log n + k) |
| LSH | High | Approximate | Euclidean | Yes | Yes | O(nL) | O(L) |
| MPLSH | High | Approximate | Euclidean | Yes | Yes | O(nL) | O(L) with fewer tables |
| RandomProjectionForest | High | Approximate | Euclidean/Angular | Yes | No | O(n T log n) | O(T log n) |
| BKTree | Any | Exact | Discrete metric only | No | Yes | O(n) | Data-dependent |
| LinearSearch | Any | Exact | Any distance | Yes | Yes | O(1) | O(n) |
Where: n = dataset size, d = dimensionality, k = neighbors requested, L = number of hash tables, T = number of trees, c = expansion constant (intrinsic dimensionality measure).
Detailed Structure Analysis
KDTree
The KD-tree is a binary space-partitioning tree that recursively splits the space along the dimension with the greatest spread. It is the optimal choice for low-dimensional exact nearest neighbor search.
When to use:
- Dimensionality d < 20 (ideally d < 10)
- Dataset size n >> 2^d (necessary condition for efficiency)
- Euclidean distance metric
- Both KNN and RNN queries needed
When to avoid:
- High-dimensional data (d > 20) -- degrades to linear scan
- Non-Euclidean metrics -- KDTree is hardcoded to use
MathEx.distance()(Euclidean)
Smile specifics: The KDTree class splits on the dimension with maximum range at each node, using midpoint splitting. It implements both KNNSearch and RNNSearch.
CoverTree
The Cover Tree provides exact nearest neighbor search in any metric space. Its complexity depends on the data's expansion constant (c), a measure of intrinsic dimensionality.
When to use:
- Custom or non-Euclidean metric (e.g., geodesic distance, Mahalanobis distance)
- Data with low intrinsic dimensionality (small expansion constant c)
- Exact search is required
When to avoid:
- Data with high intrinsic dimensionality (large c makes c^12 term dominate)
- When only Euclidean distance is needed and d < 20 (KDTree is simpler and faster)
Smile specifics: The CoverTree class accepts generic key types with a Metric<K> distance function. The base of the expansion constant defaults to 1.3 but is configurable.
LSH (Locality-Sensitive Hashing)
LSH provides approximate nearest neighbor search by hashing points with random projections so that nearby points collide with high probability.
When to use:
- High-dimensional data (d > 20)
- Approximate results are acceptable
- Euclidean distance metric
- Large dataset where exact search is prohibitively slow
When to avoid:
- Exact results are required
- Non-Euclidean metrics
- Small datasets (overhead of hash tables not justified)
Smile specifics: The projection width w controls the bucket interval. The number of hash tables L is auto-computed as max(50, n^0.25) and projections per table k as max(3, log10(n)) when using the data constructor. Supports incremental insertion via put().
MPLSH (Multi-Probe LSH)
MPLSH extends LSH with intelligent multi-probe: instead of only checking the exact bucket a query hashes to, it probes nearby buckets likely to contain true neighbors.
When to use:
- Same scenarios as LSH, but when fewer hash tables are desired (reduced memory)
- When a training set is available to fit the posteriori probe model
When to avoid:
- No training data available for the
fit()step (falls back to standard LSH) - Exact results required
Smile specifics: After construction, call fit(range, samples, radius) with a range search structure and training samples to build the probe model. Without calling fit(), MPLSH behaves identically to LSH.
RandomProjectionForest
A Random Projection Forest builds multiple random projection trees that partition data using random hyperplanes. Queries traverse all trees and compute exact distances only on the candidate set.
When to use:
- High-dimensional data
- Approximate KNN search
- Need k-nearest neighbor graph construction (built-in
toGraph()method) - Angular distance is appropriate (cosine similarity use cases)
When to avoid:
- Range (radius) search needed -- only implements
KNNSearch, notRNNSearch - Exact results required
Smile specifics: The of() factory method builds trees in parallel. The angular parameter switches between Euclidean and angular distance.
BKTree
The BK-tree is designed specifically for discrete metric spaces where distances are integers (edit distance, Hamming distance, Lee distance, Jaccard distance).
When to use:
- Discrete/integer-valued distance metric
- Range (radius) search for approximate string matching
- Dictionary lookup with edit distance threshold
When to avoid:
- Continuous distance metrics (will throw
IllegalArgumentException) - KNN search needed -- BKTree only implements
RNNSearch, notKNNSearch
Smile specifics: The search() method validates that the radius is a positive integer. BKTree supports incremental insertion via add().
LinearSearch
Linear search (brute force) computes the distance from the query to every point in the dataset.
When to use:
- Small datasets (n < 1000)
- As a correctness baseline for testing other structures
- Any distance function (most general)
- When build time must be zero (no preprocessing)
- High-dimensional data where even approximate methods are not significantly faster
When to avoid:
- Large datasets with latency requirements -- O(n) per query
Smile specifics: LinearSearch uses parallel streams for distance computation (keys.parallelStream().mapToDouble()), providing multi-core acceleration on large datasets. Accepts any Distance<K> (not limited to Metric).
Quick Selection Guide
The following flowchart summarizes the selection process:
Is the distance metric discrete (integer-valued)? YES --> BKTree (RNN only) or LinearSearch (KNN+RNN) NO --> Continue Is dimensionality low (d < 20)? YES --> KDTree (if Euclidean) or CoverTree (if custom metric) NO --> Continue Are approximate results acceptable? YES --> LSH, MPLSH, or RandomProjectionForest NO --> CoverTree (if metric) or LinearSearch (any distance) Is memory a concern with many hash tables? YES --> MPLSH (fewer tables with multi-probe) or RandomProjectionForest NO --> LSH Is angular distance preferred? YES --> RandomProjectionForest (angular=true) NO --> LSH or MPLSH Is the dataset very small (n < 1000)? YES --> LinearSearch (simplest, no overhead)
Workflow Position
Spatial Index Selection is Step 2 of the Nearest Neighbor Search workflow:
- Point Data Preparation
- Spatial Index Selection (this principle)
- Spatial Index Construction
- Nearest Neighbor Query
- Neighbor Result Processing