Principle:Haifengl Smile Spatial Index Selection

Overview

Spatial Index Selection is the process of choosing the right spatial index structure for a nearest neighbor search task based on data characteristics (dimensionality, size, metric type) and application requirements (exact vs. approximate, query type, latency constraints). The smile.neighbor package provides seven index structures, each with distinct algorithmic properties, complexity bounds, and applicability constraints.

Selecting the wrong index structure can result in orders-of-magnitude performance degradation or incorrect results. This principle provides a systematic decision framework grounded in the theoretical properties of each data structure.

Decision Framework

Primary Decision Factors

Factor	Impact	Key Thresholds
Dimensionality (d)	Determines viability of tree-based structures	d < 20 for KD-tree; d > 20 favors LSH/MPLSH/RPForest
Dataset size (n)	Affects build time and query amortization	Small n (< 1000) may not justify complex structures
Distance metric	Constrains available structures	Euclidean: all structures; general metric: CoverTree/LinearSearch; discrete: BKTree
Exact vs. approximate	Determines if LSH-family is acceptable	Approximate methods trade accuracy for speed in high dimensions
Query type	KNN vs. range (RNN) vs. both	BKTree supports only RNN; RPForest supports only KNN
Update frequency	Static vs. dynamic dataset	LSH/MPLSH support incremental insertion via `put()`

Decision Matrix

Structure	Dimensionality	Exactness	Distance	KNN	RNN	Build Complexity	Query Complexity
KDTree	Low (d < 20)	Exact	Euclidean	Yes	Yes	O(n log n)	O(n^(1-1/d) + k)
CoverTree	Any	Exact	Any metric	Yes	Yes	O(c^6 n log n)	O(c^12 log n + k)
LSH	High	Approximate	Euclidean	Yes	Yes	O(nL)	O(L)
MPLSH	High	Approximate	Euclidean	Yes	Yes	O(nL)	O(L) with fewer tables
RandomProjectionForest	High	Approximate	Euclidean/Angular	Yes	No	O(n T log n)	O(T log n)
BKTree	Any	Exact	Discrete metric only	No	Yes	O(n)	Data-dependent
LinearSearch	Any	Exact	Any distance	Yes	Yes	O(1)	O(n)

Where: n = dataset size, d = dimensionality, k = neighbors requested, L = number of hash tables, T = number of trees, c = expansion constant (intrinsic dimensionality measure).

Detailed Structure Analysis

KDTree

The KD-tree is a binary space-partitioning tree that recursively splits the space along the dimension with the greatest spread. It is the optimal choice for low-dimensional exact nearest neighbor search.

When to use:

Dimensionality d < 20 (ideally d < 10)
Dataset size n >> 2^d (necessary condition for efficiency)
Euclidean distance metric
Both KNN and RNN queries needed

When to avoid:

High-dimensional data (d > 20) -- degrades to linear scan
Non-Euclidean metrics -- KDTree is hardcoded to use MathEx.distance() (Euclidean)

Smile specifics: The KDTree class splits on the dimension with maximum range at each node, using midpoint splitting. It implements both KNNSearch and RNNSearch.

CoverTree

The Cover Tree provides exact nearest neighbor search in any metric space. Its complexity depends on the data's expansion constant (c), a measure of intrinsic dimensionality.

When to use:

Custom or non-Euclidean metric (e.g., geodesic distance, Mahalanobis distance)
Data with low intrinsic dimensionality (small expansion constant c)
Exact search is required

When to avoid:

Data with high intrinsic dimensionality (large c makes c^12 term dominate)
When only Euclidean distance is needed and d < 20 (KDTree is simpler and faster)

Smile specifics: The CoverTree class accepts generic key types with a Metric<K> distance function. The base of the expansion constant defaults to 1.3 but is configurable.

LSH (Locality-Sensitive Hashing)

LSH provides approximate nearest neighbor search by hashing points with random projections so that nearby points collide with high probability.

When to use:

High-dimensional data (d > 20)
Approximate results are acceptable
Euclidean distance metric
Large dataset where exact search is prohibitively slow

When to avoid:

Exact results are required
Non-Euclidean metrics
Small datasets (overhead of hash tables not justified)

Smile specifics: The projection width w controls the bucket interval. The number of hash tables L is auto-computed as max(50, n^0.25) and projections per table k as max(3, log10(n)) when using the data constructor. Supports incremental insertion via put().

MPLSH (Multi-Probe LSH)

MPLSH extends LSH with intelligent multi-probe: instead of only checking the exact bucket a query hashes to, it probes nearby buckets likely to contain true neighbors.

When to use:

Same scenarios as LSH, but when fewer hash tables are desired (reduced memory)
When a training set is available to fit the posteriori probe model

When to avoid:

No training data available for the fit() step (falls back to standard LSH)
Exact results required

Smile specifics: After construction, call fit(range, samples, radius) with a range search structure and training samples to build the probe model. Without calling fit(), MPLSH behaves identically to LSH.

RandomProjectionForest

A Random Projection Forest builds multiple random projection trees that partition data using random hyperplanes. Queries traverse all trees and compute exact distances only on the candidate set.

When to use:

High-dimensional data
Approximate KNN search
Need k-nearest neighbor graph construction (built-in toGraph() method)
Angular distance is appropriate (cosine similarity use cases)

When to avoid:

Range (radius) search needed -- only implements KNNSearch, not RNNSearch
Exact results required

Smile specifics: The of() factory method builds trees in parallel. The angular parameter switches between Euclidean and angular distance.

BKTree

The BK-tree is designed specifically for discrete metric spaces where distances are integers (edit distance, Hamming distance, Lee distance, Jaccard distance).

When to use:

Discrete/integer-valued distance metric
Range (radius) search for approximate string matching
Dictionary lookup with edit distance threshold

When to avoid:

Continuous distance metrics (will throw IllegalArgumentException)
KNN search needed -- BKTree only implements RNNSearch, not KNNSearch

Smile specifics: The search() method validates that the radius is a positive integer. BKTree supports incremental insertion via add().

LinearSearch

Linear search (brute force) computes the distance from the query to every point in the dataset.

When to use:

Small datasets (n < 1000)
As a correctness baseline for testing other structures
Any distance function (most general)
When build time must be zero (no preprocessing)
High-dimensional data where even approximate methods are not significantly faster

When to avoid:

Large datasets with latency requirements -- O(n) per query

Smile specifics: LinearSearch uses parallel streams for distance computation (keys.parallelStream().mapToDouble()), providing multi-core acceleration on large datasets. Accepts any Distance<K> (not limited to Metric).

Quick Selection Guide

The following flowchart summarizes the selection process:

Is the distance metric discrete (integer-valued)?
  YES --> BKTree (RNN only) or LinearSearch (KNN+RNN)
  NO  --> Continue

Is dimensionality low (d < 20)?
  YES --> KDTree (if Euclidean) or CoverTree (if custom metric)
  NO  --> Continue

Are approximate results acceptable?
  YES --> LSH, MPLSH, or RandomProjectionForest
  NO  --> CoverTree (if metric) or LinearSearch (any distance)

Is memory a concern with many hash tables?
  YES --> MPLSH (fewer tables with multi-probe) or RandomProjectionForest
  NO  --> LSH

Is angular distance preferred?
  YES --> RandomProjectionForest (angular=true)
  NO  --> LSH or MPLSH

Is the dataset very small (n < 1000)?
  YES --> LinearSearch (simplest, no overhead)

Workflow Position

Spatial Index Selection is Step 2 of the Nearest Neighbor Search workflow:

Point Data Preparation
Spatial Index Selection (this principle)
Spatial Index Construction
Nearest Neighbor Query
Neighbor Result Processing

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment

Principle:Haifengl Smile Spatial Index Selection

Overview

Decision Framework

Primary Decision Factors

Decision Matrix

Detailed Structure Analysis

KDTree

CoverTree

LSH (Locality-Sensitive Hashing)

MPLSH (Multi-Probe LSH)

RandomProjectionForest

BKTree

LinearSearch

Quick Selection Guide

Workflow Position

Related

Categories

Page Connections