Principle:Haifengl Smile Neighbor Result Processing

Overview

Neighbor Result Processing is the final step of the nearest neighbor search workflow: extracting, interpreting, and applying the results returned by KNN or RNN queries. Each result is encapsulated in a Neighbor record that contains the neighbor's key, associated value, index in the original dataset, and distance from the query point. Understanding how to process these results is essential for building downstream algorithms such as k-NN classification, density estimation, anomaly detection, and graph construction.

Theoretical Basis

The Neighbor Record as an Information Carrier

Each Neighbor<K, V> record carries four pieces of information:

Field	Type	Semantic Meaning
`key`	`K`	The spatial key of the neighbor (e.g., its coordinate vector `double[]`). This is the object used for distance computation.
`value`	`V`	The data payload associated with the neighbor. May be the same object as the key, or may carry additional information (e.g., a class label, a full data record).
`index`	`int`	The position of this neighbor in the original dataset. Enables cross-referencing with external data structures (label arrays, metadata tables).
`distance`	`double`	The computed distance from the query point to this neighbor. For approximate methods, this is the exact distance (computed during the candidate verification step), not an approximation.

Downstream Algorithms

k-NN Classification (Majority Vote)

Given k neighbors with associated class labels, predict the query's class as the most frequent label among the neighbors:

predicted_class = mode({ neighbor[i].value for i in 0..k-1 })

This is the simplest and most common use of KNN results. Ties can be broken by distance (prefer the class of the closer neighbor) or by reducing k.

Weighted k-NN (Inverse Distance Weighting)

Assign weights inversely proportional to distance, giving closer neighbors more influence:

weight[i] = 1 / (distance[i] + epsilon)
predicted_class = argmax_c sum({ weight[i] : neighbor[i].value == c })

The epsilon term prevents division by zero when a neighbor has zero distance. This weighting scheme reduces the sensitivity to the choice of k.

k-NN Regression

For continuous target values, predict the query's value as the (optionally weighted) average of the k nearest neighbors' values:

predicted_value = sum({ neighbor[i].value * weight[i] }) / sum({ weight[i] })

Local Density Estimation

The distance to the k-th nearest neighbor (the farthest in a KNN result) provides an estimate of the local density around the query point:

density estimate ~ k / volume(ball of radius = distance[k-1])

Points with large k-th neighbor distances are in sparse regions and may be anomalies.

k-Nearest Neighbor Graph

By performing KNN search for every point in the dataset, you can construct a k-NN graph where edges connect each point to its k nearest neighbors. The index field of each neighbor is used to create the graph adjacency structure.

Smile's RandomProjectionForest provides a built-in toGraph() method that constructs this graph efficiently without per-point queries.

Result Ordering and Sorting

KNN Results

Results from KNNSearch.search(q, k) are returned as a Neighbor[] array sorted by distance in ascending order (nearest first). The ordering is guaranteed by all implementations -- they either sort the heap output or use a sorted selection.

RNN Results

Results from RNNSearch.search(q, radius, neighbors) are appended to the provided list and are not sorted. The caller must explicitly sort if ordered results are needed:

Collections.sort(results); // sorts by distance (ascending) then by index

The Neighbor record implements Comparable, sorting primarily by distance and secondarily by index (to provide stable ordering for duplicate distances).

Comparison Semantics

The Neighbor.compareTo() method sorts by:

Primary: distance (ascending) via Double.compare(distance, o.distance)
Secondary: index (ascending) via Integer.compare(index, o.index)

This secondary sort ensures deterministic ordering when the dataset contains duplicate points at the same distance.

Result Set Properties

Query Type	Result Type	Size	Sorted	Contains Query
`nearest(q)`	`Neighbor<K,V>`	1	N/A	No (excluded by reference equality)
`search(q, k)`	`Neighbor<K,V>[]`	Exactly k	Yes (ascending distance)	No (excluded by reference equality)
`search(q, r, list)`	appended to `List<Neighbor<K,V>>`	Variable (0 to n)	No (must sort manually)	No (excluded by reference equality)

Special Cases

Empty Results

KNN: Always returns exactly k results if k <= dataset size. Cannot be empty (the dataset must have at least k points different from the query).
RNN: May return zero results if no points are within the radius. The caller should check results.isEmpty().
LSH nearest(): May return null if no candidates are found in any hash bucket.

Approximate Results

For approximate structures (LSH, MPLSH, RPForest):

The distance field is the exact distance (computed during candidate verification), not an approximation.
The result set may be incomplete -- some true nearest neighbors may be missing because they were not in any candidate bucket.
The returned neighbors are the true nearest among the candidate set, but may not be the globally nearest.

Duplicate Distances

When multiple points have the same distance to the query (common with discrete metrics or coincident points), the Comparable implementation uses the index field as a tiebreaker, ensuring stable and deterministic ordering.

Workflow Position

Neighbor Result Processing is Step 5 (final step) of the Nearest Neighbor Search workflow:

Point Data Preparation
Spatial Index Selection
Spatial Index Construction
Nearest Neighbor Query
Neighbor Result Processing (this principle)

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment