Principle:Lance format Lance Full Text Query Execution
| Knowledge Sources | |
|---|---|
| Domains | Information_Retrieval, Full_Text_Search |
| Last Updated | 2026-02-08 19:00 GMT |
Overview
Full-text query execution is the process of taking a structured text query, matching it against an inverted index, scoring the matched documents using BM25, and returning ranked results as a stream of (row_id, score) pairs.
Description
Lance supports a rich query algebra for full-text search that goes beyond simple keyword matching. The query system is built around the FtsQuery enum, which supports leaf queries (match and phrase), compound queries (boolean, boost, multi-match), and composition of these into arbitrarily complex search expressions. Each query type is compiled into a DataFusion execution plan node that efficiently traverses the inverted index.
All queries produce an output schema of {_rowid: UInt64, _score: Float32}, where the score is computed using the BM25 ranking function. An optional limit parameter controls how many top results are returned, and the WAND (Weak AND) algorithm is used for efficient top-k retrieval.
Usage
Execute a full-text query whenever you need to search text data that has been indexed with an inverted index. The query is passed to the Scanner via the full_text_search method, which integrates the FTS results with the standard scan pipeline.
Theoretical Basis
BM25 Scoring
Lance uses the BM25 (Best Matching 25) ranking function, the industry standard for full-text relevance scoring. The score for a document d given a query q with terms t_1, t_2, ..., t_n is:
BM25(d, q) = SUM over t_i of [ IDF(t_i) * (f(t_i, d) * (K1 + 1)) / (f(t_i, d) + K1 * (1 - B + B * |d| / avgdl)) ]
Where:
- IDF(t) = ln((N - n(t) + 0.5) / (n(t) + 0.5) + 1), the inverse document frequency
- f(t, d) = frequency of term t in document d
- |d| = number of tokens in document d
- avgdl = average document length across the corpus
- N = total number of documents
- n(t) = number of documents containing term t
- K1 = 1.2 (term frequency saturation parameter)
- B = 0.75 (document length normalization parameter)
The scoring is decomposed into two components in the code:
- query_weight(token) = IDF(token)
- doc_weight(freq, doc_tokens) = (K1 + 1) * freq / (freq + K1 * (1 - B + B * doc_tokens / avgdl))
- score(token, freq, doc_tokens) = query_weight * doc_weight
WAND Algorithm
The WAND (Weak AND) algorithm is used for efficient top-k retrieval. Rather than computing scores for all matching documents, WAND maintains a threshold score and skips documents that cannot possibly exceed it. The wand_factor parameter (default 1.0) controls the aggressiveness of this pruning. Values above 1.0 increase speed at the potential cost of recall.
Query Types
| Query Type | Description | Key Parameters |
|---|---|---|
| Match | Tokenizes query text and matches individual terms | terms, column, fuzziness, boost, operator (AND/OR), prefix_length
|
| Phrase | Matches exact ordered sequences of tokens | terms, column, slop (allowed distance between tokens)
|
| Boost | Combines a positive query with a negative demotion query | positive query, negative query, negative_boost factor (default 0.5)
|
| MultiMatch | Runs the same match query across multiple columns | List of MatchQuery instances, each targeting a different column
|
| Boolean | Combines queries with must/should/must_not semantics | must (all required), should (at least one), must_not (excluded)
|
Fuzzy Matching
Match queries support fuzzy matching via the fuzziness parameter, which specifies the maximum edit distance (Levenshtein distance). When set to None, fuzziness is determined automatically:
- 0 for terms with length <= 2
- 1 for terms with length <= 5
- 2 for terms with length > 5
The max_expansions parameter (default: 50) limits how many fuzzy term variants are expanded during the search.
DataFusion Execution Plans
Each query type compiles to a dedicated DataFusion ExecutionPlan node:
MatchQueryExec-- for Match queriesPhraseQueryExec-- for Phrase queriesBoostQueryExec-- for Boost queriesBooleanQueryExec-- for Boolean queriesFlatMatchQueryExec-- for flat (non-indexed) match queries used in hybrid search pipelines
These plan nodes integrate with the DataFusion query engine, enabling the optimizer to compose FTS operations with other relational operations like projection, filtering, and joining.