Principle:Haifengl Smile Nearest Neighbor Query
Overview
Nearest Neighbor Query is the core operation of the spatial index workflow: given a query point q and a pre-built spatial index, retrieve the closest data points according to a distance measure. Smile's smile.neighbor package defines two fundamental query types through the KNNSearch and RNNSearch interfaces:
- KNN (k-Nearest Neighbor) search: Find the k closest points to the query.
- RNN (Range / Radius Nearest Neighbor) search: Find all points within a given radius of the query.
These operations are the primary reason for building spatial indexes -- the index structure enables sub-linear query times that would otherwise require O(n) brute-force scanning.
Theoretical Basis
k-Nearest Neighbor Search (KNN)
Definition: Given a query point q, a dataset D, a distance function d, and an integer k, find the set S of k points from D that minimizes the maximum distance to q:
S = argmin_{S' subset D, |S'|=k} max_{x in S'} d(q, x)
Equivalently, find the k points with the smallest distances to q, returned in sorted order from nearest to farthest.
Properties:
- The result is a fixed-size array of exactly k neighbors.
- Results are sorted by distance in ascending order.
- If k > |D|, an
IllegalArgumentExceptionis thrown. - The query object itself is excluded from results when it exists in the dataset (using reference equality).
Applications:
- k-NN classification: Majority vote among the k nearest labeled points.
- k-NN regression: Average (or distance-weighted average) of the k nearest values.
- Anomaly detection: Points whose k-th nearest neighbor distance is large are outliers.
- k-NN graph construction: Connect each point to its k nearest neighbors.
- Manifold learning: Algorithms like UMAP and t-SNE use k-NN graphs as input.
Nearest Neighbor Search (1-NN)
A special case of KNN where k=1. The KNNSearch interface provides a dedicated nearest() method for this case, which has a default implementation calling search(q, 1)[0]. Some structures (notably KDTree) override this with a more efficient single-neighbor search that avoids heap allocation.
Range Nearest Neighbor Search (RNN)
Definition: Given a query point q, a dataset D, a distance function d, and a radius r, find all points within distance r:
RNN(q, r) = { x in D : d(q, x) <= r }
Properties:
- The result is a variable-size list (may be empty).
- Results are not guaranteed to be sorted (caller must sort if needed).
- The query object itself is excluded from results (reference equality).
- The radius must be positive (> 0), enforced by
IllegalArgumentException.
Applications:
- Density estimation: Count neighbors within a fixed radius.
- DBSCAN clustering: Core points are those with >= minPts neighbors within epsilon radius.
- Collision detection: Find all objects within a proximity threshold.
- Approximate string matching: Find all dictionary words within edit distance k (BKTree).
Query Behavior by Structure
Exact Structures
| Structure | KNN Algorithm | RNN Algorithm | Query Exclusion |
|---|---|---|---|
| KDTree | Branch-and-bound with heap; prunes branches where split distance exceeds current k-th best | Recursive traversal pruning branches beyond radius | Reference equality (q != keys[i])
|
| CoverTree | Top-down traversal with upper bound pruning based on maxDist of children |
Same traversal with fixed radius bound | Reference equality (key != q)
|
| LinearSearch | Parallel distance computation + heap selection | Parallel distance computation + threshold filter | Reference equality (q != keys.get(i))
|
| BKTree | Not supported | Recursive traversal using triangle inequality to prune child subtrees at distance d +/- k | Reference equality (key != q)
|
Approximate Structures
| Structure | KNN Algorithm | RNN Algorithm | Approximation Source |
|---|---|---|---|
| LSH | Hash query in all L tables, collect candidate set, compute exact distances on candidates | Same candidate set with radius filter | Only candidates in matching hash buckets are examined |
| MPLSH | Multi-probe hash (explore nearby buckets), then exact distance on candidates | Multi-probe candidate set with radius filter | Trained probe model determines which additional buckets to check |
| RPForest | Traverse all trees to leaf, union candidate sets, exact distance on candidates | Not supported | Only points sharing a leaf node with the query are candidates |
Query Exclusion Mechanism
All Smile spatial index structures use reference equality (==) to exclude the query object from search results. This is important because in many machine learning workflows, the dataset used to build the index is also used for querying (e.g., leave-one-out evaluation). The query point itself would trivially be its own nearest neighbor at distance 0, which is generally not useful.
Implications:
- When querying with an object from the indexed dataset, the query is automatically excluded.
- When querying with a new object (not in the index), no exclusion occurs.
- With
Stringobjects, JVM string pooling can cause unexpected exclusions if literal strings are used. In production, strings are typically read from storage and have distinct references.
Query Complexity Summary
| Structure | KNN Query | RNN Query | Notes |
|---|---|---|---|
| KDTree | O(n^(1-1/d) + k) expected | result|) | Degrades to O(n) in high dimensions |
| CoverTree | O(c^12 log n + k) | result|) | c = expansion constant |
| LSH | candidates|) | candidates|) | Approximate; may miss neighbors |
| MPLSH | candidates|) | candidates|) | Better recall per table than LSH |
| RPForest | candidates|) | N/A | Only KNN supported |
| BKTree | Data-dependent | Data-dependent | Effective for sparse discrete metrics |
| LinearSearch | O(n) | O(n) | Parallelized with Java streams |
Error Conditions
| Method | Condition | Exception |
|---|---|---|
search(q, k) |
k <= 0 | IllegalArgumentException("Invalid k: " + k)
|
search(q, k) |
k > dataset size | IllegalArgumentException("Neighbor array length is larger than the dataset size")
|
search(q, radius, neighbors) |
radius <= 0 | IllegalArgumentException("Invalid radius: " + radius)
|
search(q, radius, neighbors) (BKTree) |
radius is not a positive integer | IllegalArgumentException("The parameter radius has to be an integer: " + radius)
|
Workflow Position
Nearest Neighbor Query is Step 4 of the Nearest Neighbor Search workflow:
- Point Data Preparation
- Spatial Index Selection
- Spatial Index Construction
- Nearest Neighbor Query (this principle)
- Neighbor Result Processing