Principle:Haifengl Smile Neighbor Result Processing
Overview
Neighbor Result Processing is the final step of the nearest neighbor search workflow: extracting, interpreting, and applying the results returned by KNN or RNN queries. Each result is encapsulated in a Neighbor record that contains the neighbor's key, associated value, index in the original dataset, and distance from the query point. Understanding how to process these results is essential for building downstream algorithms such as k-NN classification, density estimation, anomaly detection, and graph construction.
Theoretical Basis
The Neighbor Record as an Information Carrier
Each Neighbor<K, V> record carries four pieces of information:
| Field | Type | Semantic Meaning |
|---|---|---|
key |
K |
The spatial key of the neighbor (e.g., its coordinate vector double[]). This is the object used for distance computation.
|
value |
V |
The data payload associated with the neighbor. May be the same object as the key, or may carry additional information (e.g., a class label, a full data record). |
index |
int |
The position of this neighbor in the original dataset. Enables cross-referencing with external data structures (label arrays, metadata tables). |
distance |
double |
The computed distance from the query point to this neighbor. For approximate methods, this is the exact distance (computed during the candidate verification step), not an approximation. |
Downstream Algorithms
k-NN Classification (Majority Vote)
Given k neighbors with associated class labels, predict the query's class as the most frequent label among the neighbors:
predicted_class = mode({ neighbor[i].value for i in 0..k-1 })
This is the simplest and most common use of KNN results. Ties can be broken by distance (prefer the class of the closer neighbor) or by reducing k.
Weighted k-NN (Inverse Distance Weighting)
Assign weights inversely proportional to distance, giving closer neighbors more influence:
weight[i] = 1 / (distance[i] + epsilon)
predicted_class = argmax_c sum({ weight[i] : neighbor[i].value == c })
The epsilon term prevents division by zero when a neighbor has zero distance. This weighting scheme reduces the sensitivity to the choice of k.
k-NN Regression
For continuous target values, predict the query's value as the (optionally weighted) average of the k nearest neighbors' values:
predicted_value = sum({ neighbor[i].value * weight[i] }) / sum({ weight[i] })
Local Density Estimation
The distance to the k-th nearest neighbor (the farthest in a KNN result) provides an estimate of the local density around the query point:
density estimate ~ k / volume(ball of radius = distance[k-1])
Points with large k-th neighbor distances are in sparse regions and may be anomalies.
k-Nearest Neighbor Graph
By performing KNN search for every point in the dataset, you can construct a k-NN graph where edges connect each point to its k nearest neighbors. The index field of each neighbor is used to create the graph adjacency structure.
Smile's RandomProjectionForest provides a built-in toGraph() method that constructs this graph efficiently without per-point queries.
Result Ordering and Sorting
KNN Results
Results from KNNSearch.search(q, k) are returned as a Neighbor[] array sorted by distance in ascending order (nearest first). The ordering is guaranteed by all implementations -- they either sort the heap output or use a sorted selection.
RNN Results
Results from RNNSearch.search(q, radius, neighbors) are appended to the provided list and are not sorted. The caller must explicitly sort if ordered results are needed:
Collections.sort(results); // sorts by distance (ascending) then by index
The Neighbor record implements Comparable, sorting primarily by distance and secondarily by index (to provide stable ordering for duplicate distances).
Comparison Semantics
The Neighbor.compareTo() method sorts by:
- Primary:
distance(ascending) viaDouble.compare(distance, o.distance) - Secondary:
index(ascending) viaInteger.compare(index, o.index)
This secondary sort ensures deterministic ordering when the dataset contains duplicate points at the same distance.
Result Set Properties
| Query Type | Result Type | Size | Sorted | Contains Query |
|---|---|---|---|---|
nearest(q) |
Neighbor<K,V> |
1 | N/A | No (excluded by reference equality) |
search(q, k) |
Neighbor<K,V>[] |
Exactly k | Yes (ascending distance) | No (excluded by reference equality) |
search(q, r, list) |
appended to List<Neighbor<K,V>> |
Variable (0 to n) | No (must sort manually) | No (excluded by reference equality) |
Special Cases
Empty Results
- KNN: Always returns exactly k results if k <= dataset size. Cannot be empty (the dataset must have at least k points different from the query).
- RNN: May return zero results if no points are within the radius. The caller should check
results.isEmpty(). - LSH nearest(): May return
nullif no candidates are found in any hash bucket.
Approximate Results
For approximate structures (LSH, MPLSH, RPForest):
- The
distancefield is the exact distance (computed during candidate verification), not an approximation. - The result set may be incomplete -- some true nearest neighbors may be missing because they were not in any candidate bucket.
- The returned neighbors are the true nearest among the candidate set, but may not be the globally nearest.
Duplicate Distances
When multiple points have the same distance to the query (common with discrete metrics or coincident points), the Comparable implementation uses the index field as a tiebreaker, ensuring stable and deterministic ordering.
Workflow Position
Neighbor Result Processing is Step 5 (final step) of the Nearest Neighbor Search workflow:
- Point Data Preparation
- Spatial Index Selection
- Spatial Index Construction
- Nearest Neighbor Query
- Neighbor Result Processing (this principle)