Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Haifengl Smile Spatial Index Selection

From Leeroopedia


Overview

Spatial Index Selection is the process of choosing the right spatial index structure for a nearest neighbor search task based on data characteristics (dimensionality, size, metric type) and application requirements (exact vs. approximate, query type, latency constraints). The smile.neighbor package provides seven index structures, each with distinct algorithmic properties, complexity bounds, and applicability constraints.

Selecting the wrong index structure can result in orders-of-magnitude performance degradation or incorrect results. This principle provides a systematic decision framework grounded in the theoretical properties of each data structure.

Decision Framework

Primary Decision Factors

Factor Impact Key Thresholds
Dimensionality (d) Determines viability of tree-based structures d < 20 for KD-tree; d > 20 favors LSH/MPLSH/RPForest
Dataset size (n) Affects build time and query amortization Small n (< 1000) may not justify complex structures
Distance metric Constrains available structures Euclidean: all structures; general metric: CoverTree/LinearSearch; discrete: BKTree
Exact vs. approximate Determines if LSH-family is acceptable Approximate methods trade accuracy for speed in high dimensions
Query type KNN vs. range (RNN) vs. both BKTree supports only RNN; RPForest supports only KNN
Update frequency Static vs. dynamic dataset LSH/MPLSH support incremental insertion via put()

Decision Matrix

Structure Dimensionality Exactness Distance KNN RNN Build Complexity Query Complexity
KDTree Low (d < 20) Exact Euclidean Yes Yes O(n log n) O(n^(1-1/d) + k)
CoverTree Any Exact Any metric Yes Yes O(c^6 n log n) O(c^12 log n + k)
LSH High Approximate Euclidean Yes Yes O(nL) O(L)
MPLSH High Approximate Euclidean Yes Yes O(nL) O(L) with fewer tables
RandomProjectionForest High Approximate Euclidean/Angular Yes No O(n T log n) O(T log n)
BKTree Any Exact Discrete metric only No Yes O(n) Data-dependent
LinearSearch Any Exact Any distance Yes Yes O(1) O(n)

Where: n = dataset size, d = dimensionality, k = neighbors requested, L = number of hash tables, T = number of trees, c = expansion constant (intrinsic dimensionality measure).

Detailed Structure Analysis

KDTree

The KD-tree is a binary space-partitioning tree that recursively splits the space along the dimension with the greatest spread. It is the optimal choice for low-dimensional exact nearest neighbor search.

When to use:

  • Dimensionality d < 20 (ideally d < 10)
  • Dataset size n >> 2^d (necessary condition for efficiency)
  • Euclidean distance metric
  • Both KNN and RNN queries needed

When to avoid:

  • High-dimensional data (d > 20) -- degrades to linear scan
  • Non-Euclidean metrics -- KDTree is hardcoded to use MathEx.distance() (Euclidean)

Smile specifics: The KDTree class splits on the dimension with maximum range at each node, using midpoint splitting. It implements both KNNSearch and RNNSearch.

CoverTree

The Cover Tree provides exact nearest neighbor search in any metric space. Its complexity depends on the data's expansion constant (c), a measure of intrinsic dimensionality.

When to use:

  • Custom or non-Euclidean metric (e.g., geodesic distance, Mahalanobis distance)
  • Data with low intrinsic dimensionality (small expansion constant c)
  • Exact search is required

When to avoid:

  • Data with high intrinsic dimensionality (large c makes c^12 term dominate)
  • When only Euclidean distance is needed and d < 20 (KDTree is simpler and faster)

Smile specifics: The CoverTree class accepts generic key types with a Metric<K> distance function. The base of the expansion constant defaults to 1.3 but is configurable.

LSH (Locality-Sensitive Hashing)

LSH provides approximate nearest neighbor search by hashing points with random projections so that nearby points collide with high probability.

When to use:

  • High-dimensional data (d > 20)
  • Approximate results are acceptable
  • Euclidean distance metric
  • Large dataset where exact search is prohibitively slow

When to avoid:

  • Exact results are required
  • Non-Euclidean metrics
  • Small datasets (overhead of hash tables not justified)

Smile specifics: The projection width w controls the bucket interval. The number of hash tables L is auto-computed as max(50, n^0.25) and projections per table k as max(3, log10(n)) when using the data constructor. Supports incremental insertion via put().

MPLSH (Multi-Probe LSH)

MPLSH extends LSH with intelligent multi-probe: instead of only checking the exact bucket a query hashes to, it probes nearby buckets likely to contain true neighbors.

When to use:

  • Same scenarios as LSH, but when fewer hash tables are desired (reduced memory)
  • When a training set is available to fit the posteriori probe model

When to avoid:

  • No training data available for the fit() step (falls back to standard LSH)
  • Exact results required

Smile specifics: After construction, call fit(range, samples, radius) with a range search structure and training samples to build the probe model. Without calling fit(), MPLSH behaves identically to LSH.

RandomProjectionForest

A Random Projection Forest builds multiple random projection trees that partition data using random hyperplanes. Queries traverse all trees and compute exact distances only on the candidate set.

When to use:

  • High-dimensional data
  • Approximate KNN search
  • Need k-nearest neighbor graph construction (built-in toGraph() method)
  • Angular distance is appropriate (cosine similarity use cases)

When to avoid:

  • Range (radius) search needed -- only implements KNNSearch, not RNNSearch
  • Exact results required

Smile specifics: The of() factory method builds trees in parallel. The angular parameter switches between Euclidean and angular distance.

BKTree

The BK-tree is designed specifically for discrete metric spaces where distances are integers (edit distance, Hamming distance, Lee distance, Jaccard distance).

When to use:

  • Discrete/integer-valued distance metric
  • Range (radius) search for approximate string matching
  • Dictionary lookup with edit distance threshold

When to avoid:

  • Continuous distance metrics (will throw IllegalArgumentException)
  • KNN search needed -- BKTree only implements RNNSearch, not KNNSearch

Smile specifics: The search() method validates that the radius is a positive integer. BKTree supports incremental insertion via add().

LinearSearch

Linear search (brute force) computes the distance from the query to every point in the dataset.

When to use:

  • Small datasets (n < 1000)
  • As a correctness baseline for testing other structures
  • Any distance function (most general)
  • When build time must be zero (no preprocessing)
  • High-dimensional data where even approximate methods are not significantly faster

When to avoid:

  • Large datasets with latency requirements -- O(n) per query

Smile specifics: LinearSearch uses parallel streams for distance computation (keys.parallelStream().mapToDouble()), providing multi-core acceleration on large datasets. Accepts any Distance<K> (not limited to Metric).

Quick Selection Guide

The following flowchart summarizes the selection process:

Is the distance metric discrete (integer-valued)?
  YES --> BKTree (RNN only) or LinearSearch (KNN+RNN)
  NO  --> Continue

Is dimensionality low (d < 20)?
  YES --> KDTree (if Euclidean) or CoverTree (if custom metric)
  NO  --> Continue

Are approximate results acceptable?
  YES --> LSH, MPLSH, or RandomProjectionForest
  NO  --> CoverTree (if metric) or LinearSearch (any distance)

Is memory a concern with many hash tables?
  YES --> MPLSH (fewer tables with multi-probe) or RandomProjectionForest
  NO  --> LSH

Is angular distance preferred?
  YES --> RandomProjectionForest (angular=true)
  NO  --> LSH or MPLSH

Is the dataset very small (n < 1000)?
  YES --> LinearSearch (simplest, no overhead)

Workflow Position

Spatial Index Selection is Step 2 of the Nearest Neighbor Search workflow:

  1. Point Data Preparation
  2. Spatial Index Selection (this principle)
  3. Spatial Index Construction
  4. Nearest Neighbor Query
  5. Neighbor Result Processing

Related

Categories

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment