Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Haifengl Smile Nearest Neighbor Query

From Leeroopedia


Overview

Nearest Neighbor Query is the core operation of the spatial index workflow: given a query point q and a pre-built spatial index, retrieve the closest data points according to a distance measure. Smile's smile.neighbor package defines two fundamental query types through the KNNSearch and RNNSearch interfaces:

  • KNN (k-Nearest Neighbor) search: Find the k closest points to the query.
  • RNN (Range / Radius Nearest Neighbor) search: Find all points within a given radius of the query.

These operations are the primary reason for building spatial indexes -- the index structure enables sub-linear query times that would otherwise require O(n) brute-force scanning.

Theoretical Basis

k-Nearest Neighbor Search (KNN)

Definition: Given a query point q, a dataset D, a distance function d, and an integer k, find the set S of k points from D that minimizes the maximum distance to q:

S = argmin_{S' subset D, |S'|=k} max_{x in S'} d(q, x)

Equivalently, find the k points with the smallest distances to q, returned in sorted order from nearest to farthest.

Properties:

  • The result is a fixed-size array of exactly k neighbors.
  • Results are sorted by distance in ascending order.
  • If k > |D|, an IllegalArgumentException is thrown.
  • The query object itself is excluded from results when it exists in the dataset (using reference equality).

Applications:

  • k-NN classification: Majority vote among the k nearest labeled points.
  • k-NN regression: Average (or distance-weighted average) of the k nearest values.
  • Anomaly detection: Points whose k-th nearest neighbor distance is large are outliers.
  • k-NN graph construction: Connect each point to its k nearest neighbors.
  • Manifold learning: Algorithms like UMAP and t-SNE use k-NN graphs as input.

Nearest Neighbor Search (1-NN)

A special case of KNN where k=1. The KNNSearch interface provides a dedicated nearest() method for this case, which has a default implementation calling search(q, 1)[0]. Some structures (notably KDTree) override this with a more efficient single-neighbor search that avoids heap allocation.

Range Nearest Neighbor Search (RNN)

Definition: Given a query point q, a dataset D, a distance function d, and a radius r, find all points within distance r:

RNN(q, r) = { x in D : d(q, x) <= r }

Properties:

  • The result is a variable-size list (may be empty).
  • Results are not guaranteed to be sorted (caller must sort if needed).
  • The query object itself is excluded from results (reference equality).
  • The radius must be positive (> 0), enforced by IllegalArgumentException.

Applications:

  • Density estimation: Count neighbors within a fixed radius.
  • DBSCAN clustering: Core points are those with >= minPts neighbors within epsilon radius.
  • Collision detection: Find all objects within a proximity threshold.
  • Approximate string matching: Find all dictionary words within edit distance k (BKTree).

Query Behavior by Structure

Exact Structures

Structure KNN Algorithm RNN Algorithm Query Exclusion
KDTree Branch-and-bound with heap; prunes branches where split distance exceeds current k-th best Recursive traversal pruning branches beyond radius Reference equality (q != keys[i])
CoverTree Top-down traversal with upper bound pruning based on maxDist of children Same traversal with fixed radius bound Reference equality (key != q)
LinearSearch Parallel distance computation + heap selection Parallel distance computation + threshold filter Reference equality (q != keys.get(i))
BKTree Not supported Recursive traversal using triangle inequality to prune child subtrees at distance d +/- k Reference equality (key != q)

Approximate Structures

Structure KNN Algorithm RNN Algorithm Approximation Source
LSH Hash query in all L tables, collect candidate set, compute exact distances on candidates Same candidate set with radius filter Only candidates in matching hash buckets are examined
MPLSH Multi-probe hash (explore nearby buckets), then exact distance on candidates Multi-probe candidate set with radius filter Trained probe model determines which additional buckets to check
RPForest Traverse all trees to leaf, union candidate sets, exact distance on candidates Not supported Only points sharing a leaf node with the query are candidates

Query Exclusion Mechanism

All Smile spatial index structures use reference equality (==) to exclude the query object from search results. This is important because in many machine learning workflows, the dataset used to build the index is also used for querying (e.g., leave-one-out evaluation). The query point itself would trivially be its own nearest neighbor at distance 0, which is generally not useful.

Implications:

  • When querying with an object from the indexed dataset, the query is automatically excluded.
  • When querying with a new object (not in the index), no exclusion occurs.
  • With String objects, JVM string pooling can cause unexpected exclusions if literal strings are used. In production, strings are typically read from storage and have distinct references.

Query Complexity Summary

Structure KNN Query RNN Query Notes
KDTree O(n^(1-1/d) + k) expected result|) Degrades to O(n) in high dimensions
CoverTree O(c^12 log n + k) result|) c = expansion constant
LSH candidates|) candidates|) Approximate; may miss neighbors
MPLSH candidates|) candidates|) Better recall per table than LSH
RPForest candidates|) N/A Only KNN supported
BKTree Data-dependent Data-dependent Effective for sparse discrete metrics
LinearSearch O(n) O(n) Parallelized with Java streams

Error Conditions

Method Condition Exception
search(q, k) k <= 0 IllegalArgumentException("Invalid k: " + k)
search(q, k) k > dataset size IllegalArgumentException("Neighbor array length is larger than the dataset size")
search(q, radius, neighbors) radius <= 0 IllegalArgumentException("Invalid radius: " + radius)
search(q, radius, neighbors) (BKTree) radius is not a positive integer IllegalArgumentException("The parameter radius has to be an integer: " + radius)

Workflow Position

Nearest Neighbor Query is Step 4 of the Nearest Neighbor Search workflow:

  1. Point Data Preparation
  2. Spatial Index Selection
  3. Spatial Index Construction
  4. Nearest Neighbor Query (this principle)
  5. Neighbor Result Processing

Related

Categories

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment