Principle:Lance format Lance Approximate Nearest Neighbor Search

Knowledge Sources	Lance Lance Docs
Domains	Vector_Search, Query_Execution
Last Updated	2026-02-08 19:00 GMT

Overview

Approximate Nearest Neighbor (ANN) Search is the process of finding the top-k vectors in a dataset that are most similar to a given query vector, using an index to avoid exhaustive comparison.

Description

Given a query vector q and a dataset of n vectors, exact nearest neighbor search requires computing the distance between q and every vector in the dataset -- an O(n * d) operation where d is the vector dimensionality. For datasets with millions or billions of vectors, this is prohibitively expensive.

ANN search trades a small amount of accuracy (recall) for dramatically faster query times. Lance's ANN search leverages the vector index built during the indexing phase to narrow the search space. The search is exposed through the Scanner API, which integrates ANN search with Lance's general-purpose query engine.

The Scanner's nearest method initiates a vector search, and optional methods configure the search behavior:

nprobes -- controls how many IVF partitions are searched (more probes = higher recall, higher latency)
ef -- controls the HNSW search beam width (higher ef = higher recall, higher latency)
refine_factor -- re-ranks candidates using original (uncompressed) vectors for better accuracy
distance_metric -- overrides the index's default distance metric at query time

The result stream includes all projected columns plus a special _distance column containing the computed distance from each result to the query vector.

Usage

Use ANN search when:

You need to find similar items (text, images, audio) in a large vector dataset.
Exact brute-force search is too slow for your latency requirements.
You have already built a vector index on the dataset.
You want to combine vector similarity with column projections and filters.

Theoretical Basis

ANN Search with IVF

For IVF-based indices, the search proceeds as:

Compute distance from q to all k centroids: O(k * d)
Select the nprobes closest centroids
Search only the vectors in those partitions: O(nprobes * n/k * d)
Return the top-k results across all searched partitions

The expected recall depends on the ratio nprobes/k. Setting nprobes = k gives exact results but no speedup; typical values are nprobes = 1 to sqrt(k).

ANN Search with HNSW

For HNSW-based indices, within each IVF partition:

Enter the graph at the top layer
At each layer, greedily traverse to the nearest neighbor of q
At the bottom layer, maintain a dynamic candidate list of size ef
Return the k closest candidates from the list

The search complexity is approximately O(ef * log(n/k)) per partition, where ef controls the trade-off between recall and latency.

Refinement

When quantization (PQ or SQ) is used, the distances computed during search are approximate. The refine step improves accuracy by:

Retrieving refine_factor * k candidate results from the index
Loading the original (uncompressed) vectors for these candidates
Recomputing exact distances
Returning the true top-k results

A refine factor of 1 still re-ranks using exact distances without fetching additional candidates. Higher factors fetch more candidates before re-ranking.

Distance Metrics

Metric	Formula	Use Case
L2 (Euclidean)	`sum((a_i - b_i)^2)`	General-purpose, default for many models
Cosine	`1 - dot(a, b) / (norm(a) * norm(b))`	Text embeddings, normalized vectors
Dot (Inner Product)	`-dot(a, b)`	Maximum inner product search, recommendation

Multi-vector Search

Lance also supports multi-vector queries, where the query is a list of vectors (e.g., multiple embeddings from a document). The column must be typed as DataType::List rather than FixedSizeList, and each query sub-vector must match the column's dimensionality.

Related Pages

Implemented By

Implementation:Lance_format_Lance_Scanner_Nearest

Uses Heuristic

Heuristic:Lance_format_Lance_Vector_Index_Tuning

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment