Principle:Haifengl Smile Nearest Neighbor Query

Overview

Nearest Neighbor Query is the core operation of the spatial index workflow: given a query point q and a pre-built spatial index, retrieve the closest data points according to a distance measure. Smile's smile.neighbor package defines two fundamental query types through the KNNSearch and RNNSearch interfaces:

KNN (k-Nearest Neighbor) search: Find the k closest points to the query.
RNN (Range / Radius Nearest Neighbor) search: Find all points within a given radius of the query.

These operations are the primary reason for building spatial indexes -- the index structure enables sub-linear query times that would otherwise require O(n) brute-force scanning.

Theoretical Basis

k-Nearest Neighbor Search (KNN)

Definition: Given a query point q, a dataset D, a distance function d, and an integer k, find the set S of k points from D that minimizes the maximum distance to q:

S = argmin_{S' subset D, |S'|=k} max_{x in S'} d(q, x)

Equivalently, find the k points with the smallest distances to q, returned in sorted order from nearest to farthest.

Properties:

The result is a fixed-size array of exactly k neighbors.
Results are sorted by distance in ascending order.
If k > |D|, an IllegalArgumentException is thrown.
The query object itself is excluded from results when it exists in the dataset (using reference equality).

Applications:

k-NN classification: Majority vote among the k nearest labeled points.
k-NN regression: Average (or distance-weighted average) of the k nearest values.
Anomaly detection: Points whose k-th nearest neighbor distance is large are outliers.
k-NN graph construction: Connect each point to its k nearest neighbors.
Manifold learning: Algorithms like UMAP and t-SNE use k-NN graphs as input.

Nearest Neighbor Search (1-NN)

A special case of KNN where k=1. The KNNSearch interface provides a dedicated nearest() method for this case, which has a default implementation calling search(q, 1)[0]. Some structures (notably KDTree) override this with a more efficient single-neighbor search that avoids heap allocation.

Range Nearest Neighbor Search (RNN)

Definition: Given a query point q, a dataset D, a distance function d, and a radius r, find all points within distance r:

RNN(q, r) = { x in D : d(q, x) <= r }

Properties:

The result is a variable-size list (may be empty).
Results are not guaranteed to be sorted (caller must sort if needed).
The query object itself is excluded from results (reference equality).
The radius must be positive (> 0), enforced by IllegalArgumentException.

Applications:

Density estimation: Count neighbors within a fixed radius.
DBSCAN clustering: Core points are those with >= minPts neighbors within epsilon radius.
Collision detection: Find all objects within a proximity threshold.
Approximate string matching: Find all dictionary words within edit distance k (BKTree).

Query Behavior by Structure

Exact Structures

Structure	KNN Algorithm	RNN Algorithm	Query Exclusion
KDTree	Branch-and-bound with heap; prunes branches where split distance exceeds current k-th best	Recursive traversal pruning branches beyond radius	Reference equality (`q != keys[i]`)
CoverTree	Top-down traversal with upper bound pruning based on `maxDist` of children	Same traversal with fixed radius bound	Reference equality (`key != q`)
LinearSearch	Parallel distance computation + heap selection	Parallel distance computation + threshold filter	Reference equality (`q != keys.get(i)`)
BKTree	Not supported	Recursive traversal using triangle inequality to prune child subtrees at distance d +/- k	Reference equality (`key != q`)

Approximate Structures

Structure	KNN Algorithm	RNN Algorithm	Approximation Source
LSH	Hash query in all L tables, collect candidate set, compute exact distances on candidates	Same candidate set with radius filter	Only candidates in matching hash buckets are examined
MPLSH	Multi-probe hash (explore nearby buckets), then exact distance on candidates	Multi-probe candidate set with radius filter	Trained probe model determines which additional buckets to check
RPForest	Traverse all trees to leaf, union candidate sets, exact distance on candidates	Not supported	Only points sharing a leaf node with the query are candidates

Query Exclusion Mechanism

All Smile spatial index structures use reference equality (==) to exclude the query object from search results. This is important because in many machine learning workflows, the dataset used to build the index is also used for querying (e.g., leave-one-out evaluation). The query point itself would trivially be its own nearest neighbor at distance 0, which is generally not useful.

Implications:

When querying with an object from the indexed dataset, the query is automatically excluded.
When querying with a new object (not in the index), no exclusion occurs.
With String objects, JVM string pooling can cause unexpected exclusions if literal strings are used. In production, strings are typically read from storage and have distinct references.

Query Complexity Summary

Structure	KNN Query	RNN Query	Notes
KDTree	O(n^(1-1/d) + k) expected	result\|)	Degrades to O(n) in high dimensions
CoverTree	O(c^12 log n + k)	result\|)	c = expansion constant
LSH	candidates\|)	candidates\|)	Approximate; may miss neighbors
MPLSH	candidates\|)	candidates\|)	Better recall per table than LSH
RPForest	candidates\|)	N/A	Only KNN supported
BKTree	Data-dependent	Data-dependent	Effective for sparse discrete metrics
LinearSearch	O(n)	O(n)	Parallelized with Java streams

Error Conditions

Method	Condition	Exception
`search(q, k)`	k <= 0	`IllegalArgumentException("Invalid k: " + k)`
`search(q, k)`	k > dataset size	`IllegalArgumentException("Neighbor array length is larger than the dataset size")`
`search(q, radius, neighbors)`	radius <= 0	`IllegalArgumentException("Invalid radius: " + radius)`
`search(q, radius, neighbors)` (BKTree)	radius is not a positive integer	`IllegalArgumentException("The parameter radius has to be an integer: " + radius)`

Workflow Position

Nearest Neighbor Query is Step 4 of the Nearest Neighbor Search workflow:

Point Data Preparation
Spatial Index Selection
Spatial Index Construction
Nearest Neighbor Query (this principle)
Neighbor Result Processing

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment