Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Haifengl Smile Index Structure Selection

From Leeroopedia


Overview

Index Structure Selection is a Pattern Doc documenting the decision process for choosing among Smile's seven spatial index structures. This page catalogs the available classes in the smile.neighbor package, their interface implementations, constructor signatures, and practical trade-offs to guide the selection decision described in the corresponding principle.

Available Structures

All index structures reside in the smile.neighbor package within the base module:

base/src/main/java/smile/neighbor/
    KDTree.java
    CoverTree.java
    LSH.java
    MPLSH.java
    RandomProjectionForest.java
    BKTree.java
    LinearSearch.java

Import statement for all structures:

import smile.neighbor.*;

Interface Implementation Summary

The two core interfaces that define search capabilities are KNNSearch<K, V> and RNNSearch<K, V>. Each structure implements one or both:

Class KNNSearch RNNSearch Serializable Key Type
KDTree<E> Yes Yes Yes double[]
CoverTree<K, V> Yes Yes Yes Generic K
LSH<E> Yes Yes Yes double[]
MPLSH<E> Yes Yes Yes double[]
RandomProjectionForest Yes No No double[]
BKTree<K, V> No Yes Yes Generic K
LinearSearch<K, V> Yes Yes Yes Generic K

Selection by Use Case

Use Case 1: Low-Dimensional Exact Search (Euclidean)

import smile.neighbor.KDTree;

// Best choice for d < 20, exact results, Euclidean metric
double[][] data = loadLowDimData(); // e.g., 3D spatial coordinates
KDTree<double[]> index = KDTree.of(data);

Why KDTree: O(n log n) build, O(n^(1-1/d) + k) expected query time. The binary space partitioning exploits low dimensionality for efficient pruning. As stated in the KDTree Javadoc: "if the dimensionality is D, then number of points in the dataset, N, should be N >> 2^D".

Use Case 2: Custom Metric Exact Search

import smile.neighbor.CoverTree;
import smile.math.distance.Metric;

// Best choice for custom metrics with exact results
double[][] data = loadData();
Metric<double[]> customMetric = (a, b) -> {
    // e.g., Manhattan distance, Mahalanobis, etc.
    double sum = 0;
    for (int i = 0; i < a.length; i++)
        sum += Math.abs(a[i] - b[i]);
    return sum;
};
CoverTree<double[], double[]> index = CoverTree.of(data, customMetric);

Why CoverTree: Only tree-based exact structure that accepts arbitrary Metric<K>. Query time O(c^12 log n) depends on the data's expansion constant, not ambient dimensionality.

Use Case 3: High-Dimensional Approximate Search

import smile.neighbor.LSH;

// Best choice for high-dim approximate search with Euclidean distance
double[][] data = loadHighDimData(); // e.g., 256-dim feature vectors
double w = 4.0; // projection width -- tune for your data scale
LSH<double[]> index = new LSH<>(data, data, w);

Why LSH: Hash-based structure that avoids the curse of dimensionality. Query time O(L) is independent of dataset size. The projection width w is the main tuning parameter.

Use Case 4: High-Dimensional with Memory Constraints

import smile.neighbor.MPLSH;
import smile.neighbor.LinearSearch;
import smile.math.MathEx;

// MPLSH needs fewer hash tables than standard LSH for same recall
int d = 256;                         // dimensionality
int L = 10;                          // fewer hash tables than LSH
int k = (int) Math.log10(data.length); // projections per hash
double w = 4.0;

MPLSH<double[]> index = new MPLSH<>(d, L, k, w);
for (int i = 0; i < data.length; i++) {
    index.put(data[i], data[i]);
}

// Train the multi-probe model for better recall with fewer tables
LinearSearch<double[], double[]> baseline = LinearSearch.of(data, MathEx::distance);
index.fit(baseline, sampleQueries, radius);

Why MPLSH: Multi-probe strategy explores nearby buckets, achieving higher recall with fewer hash tables than standard LSH, reducing memory usage.

Use Case 5: Approximate KNN with Angular Distance

import smile.neighbor.RandomProjectionForest;

// Best choice for angular/cosine similarity in high dimensions
double[][] data = loadEmbeddings(); // e.g., word embeddings
int numTrees = 10;
int leafSize = 15;
boolean angular = true;

RandomProjectionForest index = RandomProjectionForest.of(data, numTrees, leafSize, angular);

Why RandomProjectionForest: Native support for angular distance. Parallel tree construction. Built-in toGraph() method for constructing k-NN graphs. Note: does not support range search.

Use Case 6: Approximate String Matching

import smile.neighbor.BKTree;
import smile.math.distance.EditDistance;

// Best choice for dictionary lookup with edit distance threshold
String[] dictionary = loadDictionary();
BKTree<String, String> index = BKTree.of(dictionary, new EditDistance(true));

Why BKTree: Specifically designed for discrete (integer-valued) metrics. Efficiently prunes search space using the triangle inequality on integer distances. Note: supports only range search, not KNN.

Use Case 7: Small Dataset or Correctness Baseline

import smile.neighbor.LinearSearch;
import smile.math.MathEx;

// Brute force -- simple, correct, no build overhead
double[][] data = loadSmallDataset(); // n < 1000
LinearSearch<double[], double[]> index = LinearSearch.of(data, MathEx::distance);

Why LinearSearch: Zero build time, guaranteed exact results, accepts any distance function. Uses parallel streams internally for multi-core acceleration. The correct choice when n is small or as a validation baseline.

Structure Comparison Table

Criterion KDTree CoverTree LSH MPLSH RPForest BKTree LinearSearch
Build Time O(n log n) O(c^6 n log n) O(nL) O(nL) O(nT log n) O(n) O(1)
Query Time O(n^(1-1/d)) O(c^12 log n) O(L) O(L) O(T log n) Data-dep. O(n)
Space O(n) O(n) O(nL) O(nL) O(nT) O(n) O(n)
Exact? Yes Yes No No No Yes Yes
Incremental? No No Yes Yes No Yes No
Max Practical d ~20 Metric-dep. Unlimited Unlimited Unlimited N/A Unlimited

Source

  • base/src/main/java/smile/neighbor/
  • KDTree.java (414 lines), CoverTree.java (696 lines), LSH.java (292 lines), MPLSH.java (290 lines)
  • RandomProjectionForest.java (173 lines), BKTree.java (258 lines), LinearSearch.java (212 lines)

Related

Categories

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment