Principle:Rapidsai Cuml Nearest Neighbor Search

Knowledge Sources	Cover and Hart 1967 - Nearest Neighbor Pattern Classification Altman 1992 - An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression cuML Docs
Domains	Machine_Learning, Classification, Regression
Last Updated	2026-02-08 12:00 GMT

Overview

Nearest neighbor search is a non-parametric technique that classifies or regresses a query point by examining the labels or values of the k closest training points in feature space, using distance metrics to define proximity across dense and sparse data representations.

Description

The k-Nearest Neighbors (kNN) algorithm is one of the simplest and most intuitive machine learning methods. It stores the entire training dataset and defers all computation to prediction time (lazy learning). When a new query point arrives, the algorithm:

Computes the distance from the query point to every training point (or uses an index structure to find candidates efficiently).
Selects the k nearest training points.
Aggregates their labels (majority vote for classification) or values (mean or weighted mean for regression) to produce the prediction.

Brute Force Search: The simplest approach computes all pairwise distances between query points and training points. For n training points and m query points in d dimensions, this requires O(n * m * d) operations. On GPU hardware, this computation maps naturally to highly parallel matrix operations (the distance matrix is an n-by-m matrix), making brute force search competitive with more sophisticated index structures for moderate dataset sizes.

Distance Metrics: The choice of distance metric fundamentally affects which neighbors are considered "nearest." Common metrics include:

Euclidean (L2): The straight-line distance, $\sqrt{\sum_{j} (x_{j} - y_{j})^{2}}$ . The default choice for continuous features on similar scales.
Manhattan (L1): The sum of absolute differences, $\sum_{j} | x_{j} - y_{j} |$ . More robust to outliers and high-dimensional data than Euclidean.
Minkowski (Lp): A generalization parameterized by p, with L1 (p=1) and L2 (p=2) as special cases.
Cosine: Measures the angle between vectors rather than their magnitude, suitable for text and high-dimensional sparse data.

Sparse Data Support: Many real-world datasets (text corpora, user-item matrices, biological feature vectors) are naturally sparse, where most feature values are zero. Sparse kNN implementations store only non-zero values and compute distances using sparse linear algebra operations, avoiding the O(n * m * d) cost of dense computation when d is large but most entries are zero.

Usage

Nearest neighbor search is the right choice when:

No assumptions about the functional form of the decision boundary are appropriate (non-parametric setting).
The dataset is small to moderate and prediction latency is acceptable.
Local patterns in the data are more informative than global structure.
A simple, interpretable baseline is needed before exploring more complex models.
The task involves multi-class classification where class boundaries are complex or overlapping.

kNN is less suitable when the dataset is very large (query time scales linearly with dataset size in brute force mode), when features are on very different scales (requires preprocessing), or when the feature space is very high-dimensional (the curse of dimensionality degrades distance-based methods).

Theoretical Basis

kNN Classification:

$\hat{y} = mode {y_{i} : x_{i} \in N_{k} (x_{q})}$

where $N_{k} (x_{q})$ is the set of k nearest neighbors of query point $x_{q}$ .

kNN Regression:

$\hat{y} = \frac{1}{k} \sum_{x_{i} \in N_{k} (x_{q})} y_{i}$

or with distance weighting:

$\hat{y} = \frac{\sum_{x_{i} \in N_{k} (x_{q})} w_{i} \cdot y_{i}}{\sum_{x_{i} \in N_{k} (x_{q})} w_{i}}, w_{i} = \frac{1}{d (x_{q}, x_{i})}$

Brute Force Distance Matrix Computation:

Given training set X (n x d) and query set Q (m x d):

For Euclidean distance:
    D^2 = ||Q||^2 * 1^T + 1 * ||X||^2^T - 2 * Q * X^T
    D = sqrt(D^2)

This decomposes into:
    1. Row norms of Q: O(m * d)         (GPU parallel)
    2. Row norms of X: O(n * d)         (GPU parallel)
    3. Matrix product Q * X^T: O(n * m * d)  (GPU GEMM)
    4. Element-wise sqrt: O(n * m)      (GPU parallel)

For each query q in Q:
    neighbors[q] = indices of k smallest values in D[q, :]
    (GPU top-k selection or partial sort)

Sparse Distance Computation:

For sparse vectors x and y with non-zero index sets I_x and I_y:
    Euclidean: d(x, y) = sqrt(sum_{j in I_x union I_y} (x_j - y_j)^2)
    Cosine:    d(x, y) = 1 - (x . y) / (||x|| * ||y||)

Only overlapping and non-zero indices need to be visited,
reducing computation from O(d) to O(|I_x| + |I_y|) per pair.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment