Principle:Haifengl Smile Spatial Index Construction
Overview
Spatial Index Construction is the process of building a spatial index data structure from a set of input points. During construction, the algorithm organizes data into a tree, hash table, or forest structure that enables efficient subsequent queries. The construction phase is typically a one-time cost that is amortized over many queries, making its time-space trade-offs critical for overall system performance.
Each index structure in smile.neighbor uses a different algorithmic strategy for organizing data, resulting in different build times, memory footprints, and downstream query characteristics.
Theoretical Basis
KD-Tree: Recursive Binary Space Partitioning
The KD-tree construction algorithm recursively partitions points by splitting along the dimension with the greatest spread (range). At each node:
- Compute bounding box: Find the min and max value along each dimension.
- Select split dimension: Choose the dimension with the largest range (max - min).
- Compute midpoint cutoff: Split at the midpoint of the bounding box along the chosen dimension.
- Partition data: Move points with coordinate < cutoff to the left child, and points >= cutoff to the right child.
- Recurse: Build left and right subtrees.
The tree becomes a leaf when either the bounding box has zero spread (all points are identical along all dimensions) or one side of the partition is empty.
Complexity: O(n log n) average build time, O(n) space. The split strategy (midpoint of range) differs from the traditional median-based split, trading perfect balance for faster construction.
Cover Tree: Hierarchical Metric Ball Covering
The Cover Tree is built using a batch insert algorithm that maintains two invariants:
- Covering invariant: Every point is within distance base^i of at least one node at scale i.
- Separation invariant: All nodes at scale i are at least base^i apart from each other.
The algorithm (batchInsert) works recursively:
- Compute max distance from the reference point to all other points.
- Determine the scale (level) based on the max distance.
- Split points into near (within base^scale) and far (beyond base^scale).
- Recursively build the child node from near points.
- Assign remaining near points as children at the current scale.
The base parameter (default 1.3 in Smile) controls the granularity of levels. The resulting tree guarantees O(c^12 log n) query time where c is the expansion constant of the dataset.
Complexity: O(c^6 n log n) build time (theory), O(n) space.
LSH: Random Projection Hashing
Locality-Sensitive Hashing constructs L independent hash tables, each using k random projection hash functions:
- Generate random projections: For each of L hash tables, generate k random vectors a_i sampled from a stable distribution (Gaussian) and random offsets b_i from Uniform(0, w).
- Hash function: h(x) = floor((a . x + b) / w), where w is the projection width.
- Composite hash: Each hash table combines k individual hash values into a single bucket index using universal hashing.
- Insert all points: For each point, compute its hash in all L tables and insert its index into the corresponding bucket.
The probability of two points hashing to the same bucket is a monotonically decreasing function of their distance, providing the locality-sensitivity guarantee.
Complexity: O(nL) build time, O(nL) space. Parameters: L = number of hash tables, k = projections per table, w = projection width.
MPLSH: Multi-Probe Extension
MPLSH extends LSH by using MultiProbeHash instead of Hash for the hash tables. The construction is identical to standard LSH. The multi-probe capability is added after construction through the fit() method, which trains a posteriori lookup tables for generating probe sequences.
Complexity: Same as LSH for construction. The fit() step adds an additional training phase.
Random Projection Forest: Ensemble of RP-Trees
The Random Projection Forest constructs multiple random projection trees in parallel:
- For each tree: Build a
RandomProjectionTreeby recursively partitioning data with random hyperplanes. - At each node: Select a random hyperplane (either from random direction or from the difference of two random points), compute the offset, and split points to left/right children.
- Leaf nodes: Stop splitting when the leaf size threshold is reached.
- Flatten: Convert each tree to a
FlatTreerepresentation for cache-efficient query traversal.
The Johnson-Lindenstrauss lemma provides the theoretical foundation: random projections approximately preserve pairwise distances, so nearby points tend to end up in the same leaf node.
Complexity: O(n T log n) build time where T is the number of trees. Trees are built in parallel using Java's parallel streams.
BK-Tree: Discrete Metric Tree
The BK-tree is built incrementally by inserting elements one at a time:
- First element becomes the root.
- For each subsequent element: Compute the integer distance d to the current node. Navigate to the d-th child (creating it if it does not exist). If the d-th child exists, recurse into that subtree.
This construction leverages the fact that in a discrete metric, the triangle inequality constrains which subtrees need to be explored during search.
Complexity: O(n) insertions, each traversing the tree depth. Space is O(n).
LinearSearch: No Preprocessing
LinearSearch performs no construction -- it simply stores references to the data array and distance function. All work is deferred to query time.
Complexity: O(1) build time, O(n) space (reference only).
Construction Properties Summary
| Structure | Algorithm | Build Time | Space | Parallelized | Serializable |
|---|---|---|---|---|---|
| KDTree | Midpoint binary partitioning | O(n log n) | O(n) | No | Yes |
| CoverTree | Batch insert with metric balls | O(c^6 n log n) | O(n) | No | Yes |
| LSH | Random projection hash tables | O(nL) | O(nL) | No | Yes |
| MPLSH | Same as LSH + fit() training | O(nL) | O(nL) | No | Yes |
| RPForest | Parallel random projection trees | O(nT log n) | O(nT) | Yes | No |
| BKTree | Incremental insertion | O(n * depth) | O(n) | No | Yes |
| LinearSearch | None (stores references) | O(1) | O(n) | No | Yes |
Key Design Decisions in Smile
KDTree Midpoint Splitting
Smile's KDTree uses midpoint splitting rather than median splitting. The cutoff is computed as (upperBound[i] + lowerBound[i]) / 2 for the dimension with maximum spread. This avoids the O(n) median-finding step at each level, resulting in faster construction at the cost of potentially unbalanced trees. The in-place partitioning (similar to quicksort's partition) avoids additional memory allocation.
CoverTree Base Parameter
The default base of 1.3 (rather than the theoretical default of 2) results in a finer-grained level structure with more levels but better search pruning. The base is configurable through the constructor.
LSH Auto-Parameterization
When using the data constructor LSH(double[][] keys, E[] data, double w), the number of hash tables L and projections k are automatically computed as:
L = max(50, n^0.25)k = max(3, log10(n))
This provides reasonable defaults that scale with dataset size.
RPForest Parallel Construction
The RandomProjectionForest.of() factory method builds trees using IntStream.range(0, numTrees).parallel(), automatically utilizing available CPU cores. Each tree is constructed independently and then flattened to a cache-friendly array representation.
Workflow Position
Spatial Index Construction is Step 3 of the Nearest Neighbor Search workflow:
- Point Data Preparation
- Spatial Index Selection
- Spatial Index Construction (this principle)
- Nearest Neighbor Query
- Neighbor Result Processing