Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Haifengl Smile Nearest Neighbor Search

From Leeroopedia


Knowledge Sources
Domains Machine_Learning, Data_Structures
Last Updated 2026-02-08 21:00 GMT

Overview

End-to-end process for building spatial index structures and performing efficient k-nearest neighbor (KNN) and range searches on point datasets using Smile's neighbor search module.

Description

This workflow covers the construction and use of spatial index structures in the smile.neighbor package. It starts with selecting the appropriate index structure based on data dimensionality and size: KDTree for low-dimensional exact search, CoverTree for metric space search, LSH and MPLSH for approximate high-dimensional search, RandomProjectionForest for approximate nearest neighbors, and BKTree for discrete metric spaces. Once the index is built, it supports efficient k-nearest neighbor queries and radius-based range queries.

Usage

Execute this workflow when you need to find similar items in a dataset, such as recommendation systems, anomaly detection, classification (k-NN), clustering preprocessing, or duplicate detection. Choose the index structure based on your data dimensionality and accuracy requirements.

Execution Steps

Step 1: Prepare Point Data

Organize data points as double arrays or key-value pairs where keys are the coordinate vectors and values are associated data objects. Ensure all points have the same dimensionality. For discrete metric spaces (e.g., string similarity), prepare data with an appropriate distance function.

Key considerations:

  • Data should be in double[][] format for Euclidean-space structures
  • Key-value pairs allow associating metadata with each point
  • Normalize or scale features if dimensions have different ranges
  • Consider dimensionality: KDTree works well up to ~20 dimensions

Step 2: Select the Index Structure

Choose the spatial index based on data characteristics. For low-dimensional dense data, use KDTree. For general metric spaces, use CoverTree. For high-dimensional approximate search, use LSH, MPLSH, or RandomProjectionForest. For discrete metrics like edit distance, use BKTree.

Key considerations:

  • KDTree: Exact search, best for dimensions < 20, O(log n) average query
  • CoverTree: Exact search in any metric space, good for intrinsically low-dimensional data
  • LSH: Approximate search for high dimensions using locality-sensitive hashing
  • MPLSH: Multi-probe LSH improves recall over basic LSH with fewer hash tables
  • RandomProjectionForest: Approximate search using random projection trees
  • BKTree: For discrete metric spaces (e.g., Hamming, edit distance)
  • LinearSearch: Brute-force baseline for correctness verification

Step 3: Build the Index

Construct the spatial index by inserting all data points. The construction algorithm partitions the space (KDTree), builds a hierarchical cover (CoverTree), or creates hash tables (LSH). Construction is typically O(n log n) for tree structures.

Key considerations:

  • KDTree construction splits on the dimension with greatest spread
  • CoverTree construction is based on a hierarchical nesting property
  • LSH requires specifying the hash table size, number of hash functions, and bucket width
  • MPLSH requires a training step to learn optimal probe sequences
  • RandomProjectionForest requires specifying the number of trees and leaf size

Step 4: Execute Queries

Perform k-nearest neighbor (knn) or range (rnn) searches against the built index. KNN queries return the k closest points to a query point. Range queries return all points within a specified distance radius.

Key considerations:

  • knn(q, k) returns the k nearest neighbors of query point q
  • search(q, radius) returns all points within the given radius
  • By default, the query object itself is excluded from results (by reference equality)
  • Results are returned as Neighbor objects containing the point, key, index, and distance
  • For approximate methods (LSH), results may not include all true nearest neighbors

Step 5: Process Results

Extract and use the query results. Neighbor objects provide the matched point, its associated data, the index in the original dataset, and the computed distance. Use results for classification voting, anomaly scoring, or similarity analysis.

Key considerations:

  • Neighbor.index provides the position in the original dataset
  • Neighbor.distance provides the computed distance
  • Neighbor.key provides the coordinate vector
  • Neighbor.value provides the associated data object
  • For k-NN classification, use majority voting among the k neighbors

Execution Diagram

GitHub URL

Workflow Repository