Principle:Rapidsai Cuml Nearest Neighbor Search
| Knowledge Sources | |
|---|---|
| Domains | Machine_Learning, Classification, Regression |
| Last Updated | 2026-02-08 12:00 GMT |
Overview
Nearest neighbor search is a non-parametric technique that classifies or regresses a query point by examining the labels or values of the k closest training points in feature space, using distance metrics to define proximity across dense and sparse data representations.
Description
The k-Nearest Neighbors (kNN) algorithm is one of the simplest and most intuitive machine learning methods. It stores the entire training dataset and defers all computation to prediction time (lazy learning). When a new query point arrives, the algorithm:
- Computes the distance from the query point to every training point (or uses an index structure to find candidates efficiently).
- Selects the k nearest training points.
- Aggregates their labels (majority vote for classification) or values (mean or weighted mean for regression) to produce the prediction.
Brute Force Search: The simplest approach computes all pairwise distances between query points and training points. For n training points and m query points in d dimensions, this requires O(n * m * d) operations. On GPU hardware, this computation maps naturally to highly parallel matrix operations (the distance matrix is an n-by-m matrix), making brute force search competitive with more sophisticated index structures for moderate dataset sizes.
Distance Metrics: The choice of distance metric fundamentally affects which neighbors are considered "nearest." Common metrics include:
- Euclidean (L2): The straight-line distance, . The default choice for continuous features on similar scales.
- Manhattan (L1): The sum of absolute differences, . More robust to outliers and high-dimensional data than Euclidean.
- Minkowski (Lp): A generalization parameterized by p, with L1 (p=1) and L2 (p=2) as special cases.
- Cosine: Measures the angle between vectors rather than their magnitude, suitable for text and high-dimensional sparse data.
Sparse Data Support: Many real-world datasets (text corpora, user-item matrices, biological feature vectors) are naturally sparse, where most feature values are zero. Sparse kNN implementations store only non-zero values and compute distances using sparse linear algebra operations, avoiding the O(n * m * d) cost of dense computation when d is large but most entries are zero.
Usage
Nearest neighbor search is the right choice when:
- No assumptions about the functional form of the decision boundary are appropriate (non-parametric setting).
- The dataset is small to moderate and prediction latency is acceptable.
- Local patterns in the data are more informative than global structure.
- A simple, interpretable baseline is needed before exploring more complex models.
- The task involves multi-class classification where class boundaries are complex or overlapping.
kNN is less suitable when the dataset is very large (query time scales linearly with dataset size in brute force mode), when features are on very different scales (requires preprocessing), or when the feature space is very high-dimensional (the curse of dimensionality degrades distance-based methods).
Theoretical Basis
kNN Classification:
where is the set of k nearest neighbors of query point .
kNN Regression:
or with distance weighting:
Brute Force Distance Matrix Computation:
Given training set X (n x d) and query set Q (m x d):
For Euclidean distance:
D^2 = ||Q||^2 * 1^T + 1 * ||X||^2^T - 2 * Q * X^T
D = sqrt(D^2)
This decomposes into:
1. Row norms of Q: O(m * d) (GPU parallel)
2. Row norms of X: O(n * d) (GPU parallel)
3. Matrix product Q * X^T: O(n * m * d) (GPU GEMM)
4. Element-wise sqrt: O(n * m) (GPU parallel)
For each query q in Q:
neighbors[q] = indices of k smallest values in D[q, :]
(GPU top-k selection or partial sort)
Sparse Distance Computation:
For sparse vectors x and y with non-zero index sets I_x and I_y:
Euclidean: d(x, y) = sqrt(sum_{j in I_x union I_y} (x_j - y_j)^2)
Cosine: d(x, y) = 1 - (x . y) / (||x|| * ||y||)
Only overlapping and non-zero indices need to be visited,
reducing computation from O(d) to O(|I_x| + |I_y|) per pair.