Implementation:Haifengl Smile Index Structure Selection

Overview

Index Structure Selection is a Pattern Doc documenting the decision process for choosing among Smile's seven spatial index structures. This page catalogs the available classes in the smile.neighbor package, their interface implementations, constructor signatures, and practical trade-offs to guide the selection decision described in the corresponding principle.

Available Structures

All index structures reside in the smile.neighbor package within the base module:

base/src/main/java/smile/neighbor/
    KDTree.java
    CoverTree.java
    LSH.java
    MPLSH.java
    RandomProjectionForest.java
    BKTree.java
    LinearSearch.java

Import statement for all structures:

import smile.neighbor.*;

Interface Implementation Summary

The two core interfaces that define search capabilities are KNNSearch<K, V> and RNNSearch<K, V>. Each structure implements one or both:

Class	`KNNSearch`	`RNNSearch`	`Serializable`	Key Type
`KDTree<E>`	Yes	Yes	Yes	`double[]`
`CoverTree<K, V>`	Yes	Yes	Yes	Generic `K`
`LSH<E>`	Yes	Yes	Yes	`double[]`
`MPLSH<E>`	Yes	Yes	Yes	`double[]`
`RandomProjectionForest`	Yes	No	No	`double[]`
`BKTree<K, V>`	No	Yes	Yes	Generic `K`
`LinearSearch<K, V>`	Yes	Yes	Yes	Generic `K`

Selection by Use Case

Use Case 1: Low-Dimensional Exact Search (Euclidean)

import smile.neighbor.KDTree;

// Best choice for d < 20, exact results, Euclidean metric
double[][] data = loadLowDimData(); // e.g., 3D spatial coordinates
KDTree<double[]> index = KDTree.of(data);

Why KDTree: O(n log n) build, O(n^(1-1/d) + k) expected query time. The binary space partitioning exploits low dimensionality for efficient pruning. As stated in the KDTree Javadoc: "if the dimensionality is D, then number of points in the dataset, N, should be N >> 2^D".

Use Case 2: Custom Metric Exact Search

import smile.neighbor.CoverTree;
import smile.math.distance.Metric;

// Best choice for custom metrics with exact results
double[][] data = loadData();
Metric<double[]> customMetric = (a, b) -> {
    // e.g., Manhattan distance, Mahalanobis, etc.
    double sum = 0;
    for (int i = 0; i < a.length; i++)
        sum += Math.abs(a[i] - b[i]);
    return sum;
};
CoverTree<double[], double[]> index = CoverTree.of(data, customMetric);

Why CoverTree: Only tree-based exact structure that accepts arbitrary Metric<K>. Query time O(c^12 log n) depends on the data's expansion constant, not ambient dimensionality.

Use Case 3: High-Dimensional Approximate Search

import smile.neighbor.LSH;

// Best choice for high-dim approximate search with Euclidean distance
double[][] data = loadHighDimData(); // e.g., 256-dim feature vectors
double w = 4.0; // projection width -- tune for your data scale
LSH<double[]> index = new LSH<>(data, data, w);

Why LSH: Hash-based structure that avoids the curse of dimensionality. Query time O(L) is independent of dataset size. The projection width w is the main tuning parameter.

Use Case 4: High-Dimensional with Memory Constraints

import smile.neighbor.MPLSH;
import smile.neighbor.LinearSearch;
import smile.math.MathEx;

// MPLSH needs fewer hash tables than standard LSH for same recall
int d = 256;                         // dimensionality
int L = 10;                          // fewer hash tables than LSH
int k = (int) Math.log10(data.length); // projections per hash
double w = 4.0;

MPLSH<double[]> index = new MPLSH<>(d, L, k, w);
for (int i = 0; i < data.length; i++) {
    index.put(data[i], data[i]);
}

// Train the multi-probe model for better recall with fewer tables
LinearSearch<double[], double[]> baseline = LinearSearch.of(data, MathEx::distance);
index.fit(baseline, sampleQueries, radius);

Why MPLSH: Multi-probe strategy explores nearby buckets, achieving higher recall with fewer hash tables than standard LSH, reducing memory usage.

Use Case 5: Approximate KNN with Angular Distance

import smile.neighbor.RandomProjectionForest;

// Best choice for angular/cosine similarity in high dimensions
double[][] data = loadEmbeddings(); // e.g., word embeddings
int numTrees = 10;
int leafSize = 15;
boolean angular = true;

RandomProjectionForest index = RandomProjectionForest.of(data, numTrees, leafSize, angular);

Why RandomProjectionForest: Native support for angular distance. Parallel tree construction. Built-in toGraph() method for constructing k-NN graphs. Note: does not support range search.

Use Case 6: Approximate String Matching

import smile.neighbor.BKTree;
import smile.math.distance.EditDistance;

// Best choice for dictionary lookup with edit distance threshold
String[] dictionary = loadDictionary();
BKTree<String, String> index = BKTree.of(dictionary, new EditDistance(true));

Why BKTree: Specifically designed for discrete (integer-valued) metrics. Efficiently prunes search space using the triangle inequality on integer distances. Note: supports only range search, not KNN.

Use Case 7: Small Dataset or Correctness Baseline

import smile.neighbor.LinearSearch;
import smile.math.MathEx;

// Brute force -- simple, correct, no build overhead
double[][] data = loadSmallDataset(); // n < 1000
LinearSearch<double[], double[]> index = LinearSearch.of(data, MathEx::distance);

Why LinearSearch: Zero build time, guaranteed exact results, accepts any distance function. Uses parallel streams internally for multi-core acceleration. The correct choice when n is small or as a validation baseline.

Structure Comparison Table

Criterion	KDTree	CoverTree	LSH	MPLSH	RPForest	BKTree	LinearSearch
Build Time	O(n log n)	O(c^6 n log n)	O(nL)	O(nL)	O(nT log n)	O(n)	O(1)
Query Time	O(n^(1-1/d))	O(c^12 log n)	O(L)	O(L)	O(T log n)	Data-dep.	O(n)
Space	O(n)	O(n)	O(nL)	O(nL)	O(nT)	O(n)	O(n)
Exact?	Yes	Yes	No	No	No	Yes	Yes
Incremental?	No	No	Yes	Yes	No	Yes	No
Max Practical d	~20	Metric-dep.	Unlimited	Unlimited	Unlimited	N/A	Unlimited

Source

base/src/main/java/smile/neighbor/
KDTree.java (414 lines), CoverTree.java (696 lines), LSH.java (292 lines), MPLSH.java (290 lines)
RandomProjectionForest.java (173 lines), BKTree.java (258 lines), LinearSearch.java (212 lines)

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment