Principle:Haifengl Smile Spatial Index Construction

Overview

Spatial Index Construction is the process of building a spatial index data structure from a set of input points. During construction, the algorithm organizes data into a tree, hash table, or forest structure that enables efficient subsequent queries. The construction phase is typically a one-time cost that is amortized over many queries, making its time-space trade-offs critical for overall system performance.

Each index structure in smile.neighbor uses a different algorithmic strategy for organizing data, resulting in different build times, memory footprints, and downstream query characteristics.

Theoretical Basis

KD-Tree: Recursive Binary Space Partitioning

The KD-tree construction algorithm recursively partitions points by splitting along the dimension with the greatest spread (range). At each node:

Compute bounding box: Find the min and max value along each dimension.
Select split dimension: Choose the dimension with the largest range (max - min).
Compute midpoint cutoff: Split at the midpoint of the bounding box along the chosen dimension.
Partition data: Move points with coordinate < cutoff to the left child, and points >= cutoff to the right child.
Recurse: Build left and right subtrees.

The tree becomes a leaf when either the bounding box has zero spread (all points are identical along all dimensions) or one side of the partition is empty.

Complexity: O(n log n) average build time, O(n) space. The split strategy (midpoint of range) differs from the traditional median-based split, trading perfect balance for faster construction.

Cover Tree: Hierarchical Metric Ball Covering

The Cover Tree is built using a batch insert algorithm that maintains two invariants:

Covering invariant: Every point is within distance base^i of at least one node at scale i.
Separation invariant: All nodes at scale i are at least base^i apart from each other.

The algorithm (batchInsert) works recursively:

Compute max distance from the reference point to all other points.
Determine the scale (level) based on the max distance.
Split points into near (within base^scale) and far (beyond base^scale).
Recursively build the child node from near points.
Assign remaining near points as children at the current scale.

The base parameter (default 1.3 in Smile) controls the granularity of levels. The resulting tree guarantees O(c^12 log n) query time where c is the expansion constant of the dataset.

Complexity: O(c^6 n log n) build time (theory), O(n) space.

LSH: Random Projection Hashing

Locality-Sensitive Hashing constructs L independent hash tables, each using k random projection hash functions:

Generate random projections: For each of L hash tables, generate k random vectors a_i sampled from a stable distribution (Gaussian) and random offsets b_i from Uniform(0, w).
Hash function: h(x) = floor((a . x + b) / w), where w is the projection width.
Composite hash: Each hash table combines k individual hash values into a single bucket index using universal hashing.
Insert all points: For each point, compute its hash in all L tables and insert its index into the corresponding bucket.

The probability of two points hashing to the same bucket is a monotonically decreasing function of their distance, providing the locality-sensitivity guarantee.

Complexity: O(nL) build time, O(nL) space. Parameters: L = number of hash tables, k = projections per table, w = projection width.

MPLSH: Multi-Probe Extension

MPLSH extends LSH by using MultiProbeHash instead of Hash for the hash tables. The construction is identical to standard LSH. The multi-probe capability is added after construction through the fit() method, which trains a posteriori lookup tables for generating probe sequences.

Complexity: Same as LSH for construction. The fit() step adds an additional training phase.

Random Projection Forest: Ensemble of RP-Trees

The Random Projection Forest constructs multiple random projection trees in parallel:

For each tree: Build a RandomProjectionTree by recursively partitioning data with random hyperplanes.
At each node: Select a random hyperplane (either from random direction or from the difference of two random points), compute the offset, and split points to left/right children.
Leaf nodes: Stop splitting when the leaf size threshold is reached.
Flatten: Convert each tree to a FlatTree representation for cache-efficient query traversal.

The Johnson-Lindenstrauss lemma provides the theoretical foundation: random projections approximately preserve pairwise distances, so nearby points tend to end up in the same leaf node.

Complexity: O(n T log n) build time where T is the number of trees. Trees are built in parallel using Java's parallel streams.

BK-Tree: Discrete Metric Tree

The BK-tree is built incrementally by inserting elements one at a time:

First element becomes the root.
For each subsequent element: Compute the integer distance d to the current node. Navigate to the d-th child (creating it if it does not exist). If the d-th child exists, recurse into that subtree.

This construction leverages the fact that in a discrete metric, the triangle inequality constrains which subtrees need to be explored during search.

Complexity: O(n) insertions, each traversing the tree depth. Space is O(n).

LinearSearch: No Preprocessing

LinearSearch performs no construction -- it simply stores references to the data array and distance function. All work is deferred to query time.

Complexity: O(1) build time, O(n) space (reference only).

Construction Properties Summary

Structure	Algorithm	Build Time	Space	Parallelized	Serializable
KDTree	Midpoint binary partitioning	O(n log n)	O(n)	No	Yes
CoverTree	Batch insert with metric balls	O(c^6 n log n)	O(n)	No	Yes
LSH	Random projection hash tables	O(nL)	O(nL)	No	Yes
MPLSH	Same as LSH + fit() training	O(nL)	O(nL)	No	Yes
RPForest	Parallel random projection trees	O(nT log n)	O(nT)	Yes	No
BKTree	Incremental insertion	O(n * depth)	O(n)	No	Yes
LinearSearch	None (stores references)	O(1)	O(n)	No	Yes

Key Design Decisions in Smile

KDTree Midpoint Splitting

Smile's KDTree uses midpoint splitting rather than median splitting. The cutoff is computed as (upperBound[i] + lowerBound[i]) / 2 for the dimension with maximum spread. This avoids the O(n) median-finding step at each level, resulting in faster construction at the cost of potentially unbalanced trees. The in-place partitioning (similar to quicksort's partition) avoids additional memory allocation.

CoverTree Base Parameter

The default base of 1.3 (rather than the theoretical default of 2) results in a finer-grained level structure with more levels but better search pruning. The base is configurable through the constructor.

LSH Auto-Parameterization

When using the data constructor LSH(double[][] keys, E[] data, double w), the number of hash tables L and projections k are automatically computed as:

L = max(50, n^0.25)
k = max(3, log10(n))

This provides reasonable defaults that scale with dataset size.

RPForest Parallel Construction

The RandomProjectionForest.of() factory method builds trees using IntStream.range(0, numTrees).parallel(), automatically utilizing available CPU cores. Each tree is constructed independently and then flattened to a cache-friendly array representation.

Workflow Position

Spatial Index Construction is Step 3 of the Nearest Neighbor Search workflow:

Point Data Preparation
Spatial Index Selection
Spatial Index Construction (this principle)
Nearest Neighbor Query
Neighbor Result Processing

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment