Principle:Lance format Lance Full Text Query Execution

Knowledge Sources	Lance Lance Docs
Domains	Information_Retrieval, Full_Text_Search
Last Updated	2026-02-08 19:00 GMT

Overview

Full-text query execution is the process of taking a structured text query, matching it against an inverted index, scoring the matched documents using BM25, and returning ranked results as a stream of (row_id, score) pairs.

Description

Lance supports a rich query algebra for full-text search that goes beyond simple keyword matching. The query system is built around the FtsQuery enum, which supports leaf queries (match and phrase), compound queries (boolean, boost, multi-match), and composition of these into arbitrarily complex search expressions. Each query type is compiled into a DataFusion execution plan node that efficiently traverses the inverted index.

All queries produce an output schema of {_rowid: UInt64, _score: Float32}, where the score is computed using the BM25 ranking function. An optional limit parameter controls how many top results are returned, and the WAND (Weak AND) algorithm is used for efficient top-k retrieval.

Usage

Execute a full-text query whenever you need to search text data that has been indexed with an inverted index. The query is passed to the Scanner via the full_text_search method, which integrates the FTS results with the standard scan pipeline.

Theoretical Basis

BM25 Scoring

Lance uses the BM25 (Best Matching 25) ranking function, the industry standard for full-text relevance scoring. The score for a document d given a query q with terms t_1, t_2, ..., t_n is:

BM25(d, q) = SUM over t_i of [ IDF(t_i) * (f(t_i, d) * (K1 + 1)) / (f(t_i, d) + K1 * (1 - B + B * |d| / avgdl)) ]

Where:

IDF(t) = ln((N - n(t) + 0.5) / (n(t) + 0.5) + 1), the inverse document frequency
f(t, d) = frequency of term t in document d
|d| = number of tokens in document d
avgdl = average document length across the corpus
N = total number of documents
n(t) = number of documents containing term t
K1 = 1.2 (term frequency saturation parameter)
B = 0.75 (document length normalization parameter)

The scoring is decomposed into two components in the code:

query_weight(token) = IDF(token)
doc_weight(freq, doc_tokens) = (K1 + 1) * freq / (freq + K1 * (1 - B + B * doc_tokens / avgdl))
score(token, freq, doc_tokens) = query_weight * doc_weight

WAND Algorithm

The WAND (Weak AND) algorithm is used for efficient top-k retrieval. Rather than computing scores for all matching documents, WAND maintains a threshold score and skips documents that cannot possibly exceed it. The wand_factor parameter (default 1.0) controls the aggressiveness of this pruning. Values above 1.0 increase speed at the potential cost of recall.

Query Types

Query Type	Description	Key Parameters
Match	Tokenizes query text and matches individual terms	`terms`, `column`, `fuzziness`, `boost`, `operator` (AND/OR), `prefix_length`
Phrase	Matches exact ordered sequences of tokens	`terms`, `column`, `slop` (allowed distance between tokens)
Boost	Combines a positive query with a negative demotion query	`positive` query, `negative` query, `negative_boost` factor (default 0.5)
MultiMatch	Runs the same match query across multiple columns	List of `MatchQuery` instances, each targeting a different column
Boolean	Combines queries with must/should/must_not semantics	`must` (all required), `should` (at least one), `must_not` (excluded)

Fuzzy Matching

Match queries support fuzzy matching via the fuzziness parameter, which specifies the maximum edit distance (Levenshtein distance). When set to None, fuzziness is determined automatically:

0 for terms with length <= 2
1 for terms with length <= 5
2 for terms with length > 5

The max_expansions parameter (default: 50) limits how many fuzzy term variants are expanded during the search.

DataFusion Execution Plans

Each query type compiles to a dedicated DataFusion ExecutionPlan node:

MatchQueryExec -- for Match queries
PhraseQueryExec -- for Phrase queries
BoostQueryExec -- for Boost queries
BooleanQueryExec -- for Boolean queries
FlatMatchQueryExec -- for flat (non-indexed) match queries used in hybrid search pipelines

These plan nodes integrate with the DataFusion query engine, enabling the optimizer to compose FTS operations with other relational operations like projection, filtering, and joining.

Related Pages

Implemented By

Implementation:Lance_format_Lance_Scanner_Full_Text_Search

Uses Heuristic

Heuristic:Lance_format_Lance_BM25_FTS_Configuration

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment