Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Lance format Lance Full Text Query Execution

From Leeroopedia


Knowledge Sources
Domains Information_Retrieval, Full_Text_Search
Last Updated 2026-02-08 19:00 GMT

Overview

Full-text query execution is the process of taking a structured text query, matching it against an inverted index, scoring the matched documents using BM25, and returning ranked results as a stream of (row_id, score) pairs.

Description

Lance supports a rich query algebra for full-text search that goes beyond simple keyword matching. The query system is built around the FtsQuery enum, which supports leaf queries (match and phrase), compound queries (boolean, boost, multi-match), and composition of these into arbitrarily complex search expressions. Each query type is compiled into a DataFusion execution plan node that efficiently traverses the inverted index.

All queries produce an output schema of {_rowid: UInt64, _score: Float32}, where the score is computed using the BM25 ranking function. An optional limit parameter controls how many top results are returned, and the WAND (Weak AND) algorithm is used for efficient top-k retrieval.

Usage

Execute a full-text query whenever you need to search text data that has been indexed with an inverted index. The query is passed to the Scanner via the full_text_search method, which integrates the FTS results with the standard scan pipeline.

Theoretical Basis

BM25 Scoring

Lance uses the BM25 (Best Matching 25) ranking function, the industry standard for full-text relevance scoring. The score for a document d given a query q with terms t_1, t_2, ..., t_n is:

BM25(d, q) = SUM over t_i of [ IDF(t_i) * (f(t_i, d) * (K1 + 1)) / (f(t_i, d) + K1 * (1 - B + B * |d| / avgdl)) ]

Where:

  • IDF(t) = ln((N - n(t) + 0.5) / (n(t) + 0.5) + 1), the inverse document frequency
  • f(t, d) = frequency of term t in document d
  • |d| = number of tokens in document d
  • avgdl = average document length across the corpus
  • N = total number of documents
  • n(t) = number of documents containing term t
  • K1 = 1.2 (term frequency saturation parameter)
  • B = 0.75 (document length normalization parameter)

The scoring is decomposed into two components in the code:

  • query_weight(token) = IDF(token)
  • doc_weight(freq, doc_tokens) = (K1 + 1) * freq / (freq + K1 * (1 - B + B * doc_tokens / avgdl))
  • score(token, freq, doc_tokens) = query_weight * doc_weight

WAND Algorithm

The WAND (Weak AND) algorithm is used for efficient top-k retrieval. Rather than computing scores for all matching documents, WAND maintains a threshold score and skips documents that cannot possibly exceed it. The wand_factor parameter (default 1.0) controls the aggressiveness of this pruning. Values above 1.0 increase speed at the potential cost of recall.

Query Types

Query Type Description Key Parameters
Match Tokenizes query text and matches individual terms terms, column, fuzziness, boost, operator (AND/OR), prefix_length
Phrase Matches exact ordered sequences of tokens terms, column, slop (allowed distance between tokens)
Boost Combines a positive query with a negative demotion query positive query, negative query, negative_boost factor (default 0.5)
MultiMatch Runs the same match query across multiple columns List of MatchQuery instances, each targeting a different column
Boolean Combines queries with must/should/must_not semantics must (all required), should (at least one), must_not (excluded)

Fuzzy Matching

Match queries support fuzzy matching via the fuzziness parameter, which specifies the maximum edit distance (Levenshtein distance). When set to None, fuzziness is determined automatically:

  • 0 for terms with length <= 2
  • 1 for terms with length <= 5
  • 2 for terms with length > 5

The max_expansions parameter (default: 50) limits how many fuzzy term variants are expanded during the search.

DataFusion Execution Plans

Each query type compiles to a dedicated DataFusion ExecutionPlan node:

  • MatchQueryExec -- for Match queries
  • PhraseQueryExec -- for Phrase queries
  • BoostQueryExec -- for Boost queries
  • BooleanQueryExec -- for Boolean queries
  • FlatMatchQueryExec -- for flat (non-indexed) match queries used in hybrid search pipelines

These plan nodes integrate with the DataFusion query engine, enabling the optimizer to compose FTS operations with other relational operations like projection, filtering, and joining.

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment