Workflow:Haifengl Smile Nearest Neighbor Search

Knowledge Sources	Smile Smile Docs
Domains	Machine_Learning, Data_Structures
Last Updated	2026-02-08 21:00 GMT

Overview

End-to-end process for building spatial index structures and performing efficient k-nearest neighbor (KNN) and range searches on point datasets using Smile's neighbor search module.

Description

This workflow covers the construction and use of spatial index structures in the smile.neighbor package. It starts with selecting the appropriate index structure based on data dimensionality and size: KDTree for low-dimensional exact search, CoverTree for metric space search, LSH and MPLSH for approximate high-dimensional search, RandomProjectionForest for approximate nearest neighbors, and BKTree for discrete metric spaces. Once the index is built, it supports efficient k-nearest neighbor queries and radius-based range queries.

Usage

Execute this workflow when you need to find similar items in a dataset, such as recommendation systems, anomaly detection, classification (k-NN), clustering preprocessing, or duplicate detection. Choose the index structure based on your data dimensionality and accuracy requirements.

Execution Steps

Step 1: Prepare Point Data

Organize data points as double arrays or key-value pairs where keys are the coordinate vectors and values are associated data objects. Ensure all points have the same dimensionality. For discrete metric spaces (e.g., string similarity), prepare data with an appropriate distance function.

Key considerations:

Data should be in double[][] format for Euclidean-space structures
Key-value pairs allow associating metadata with each point
Normalize or scale features if dimensions have different ranges
Consider dimensionality: KDTree works well up to ~20 dimensions

Step 2: Select the Index Structure

Choose the spatial index based on data characteristics. For low-dimensional dense data, use KDTree. For general metric spaces, use CoverTree. For high-dimensional approximate search, use LSH, MPLSH, or RandomProjectionForest. For discrete metrics like edit distance, use BKTree.

Key considerations:

KDTree: Exact search, best for dimensions < 20, O(log n) average query
CoverTree: Exact search in any metric space, good for intrinsically low-dimensional data
LSH: Approximate search for high dimensions using locality-sensitive hashing
MPLSH: Multi-probe LSH improves recall over basic LSH with fewer hash tables
RandomProjectionForest: Approximate search using random projection trees
BKTree: For discrete metric spaces (e.g., Hamming, edit distance)
LinearSearch: Brute-force baseline for correctness verification

Step 3: Build the Index

Construct the spatial index by inserting all data points. The construction algorithm partitions the space (KDTree), builds a hierarchical cover (CoverTree), or creates hash tables (LSH). Construction is typically O(n log n) for tree structures.

Key considerations:

KDTree construction splits on the dimension with greatest spread
CoverTree construction is based on a hierarchical nesting property
LSH requires specifying the hash table size, number of hash functions, and bucket width
MPLSH requires a training step to learn optimal probe sequences
RandomProjectionForest requires specifying the number of trees and leaf size

Step 4: Execute Queries

Perform k-nearest neighbor (knn) or range (rnn) searches against the built index. KNN queries return the k closest points to a query point. Range queries return all points within a specified distance radius.

Key considerations:

knn(q, k) returns the k nearest neighbors of query point q
search(q, radius) returns all points within the given radius
By default, the query object itself is excluded from results (by reference equality)
Results are returned as Neighbor objects containing the point, key, index, and distance
For approximate methods (LSH), results may not include all true nearest neighbors

Step 5: Process Results

Extract and use the query results. Neighbor objects provide the matched point, its associated data, the index in the original dataset, and the computed distance. Use results for classification voting, anomaly scoring, or similarity analysis.

Key considerations:

Neighbor.index provides the position in the original dataset
Neighbor.distance provides the computed distance
Neighbor.key provides the coordinate vector
Neighbor.value provides the associated data object
For k-NN classification, use majority voting among the k neighbors

Execution Diagram

GitHub URL

Workflow Repository