Workflow:Lance format Lance Full Text Search

Knowledge Sources	Lance Lance Tokenizer Guide FTS Index Spec Lance Docs
Domains	Information_Retrieval, Full_Text_Search, ML_Ops
Last Updated	2026-02-08 19:00 GMT

Overview

End-to-end process for building inverted indices on text columns and performing full-text search queries with optional hybrid vector-text retrieval in a Lance dataset.

Description

This workflow covers the full-text search pipeline from tokenizer configuration through inverted index construction to query execution. Lance supports inverted indices with configurable tokenizers (including language-specific stemmers and n-gram tokenizers), BM25-based relevance scoring, and boolean query composition. Full-text search can be combined with vector similarity search for hybrid retrieval that leverages both semantic and lexical matching.

Usage

Execute this workflow when you have text data (documents, product descriptions, log messages) and need keyword-based search, when you want to combine lexical matching with vector similarity for hybrid retrieval, or when you need to filter datasets by text content patterns (e.g., log analysis, content moderation).

Execution Steps

Step 1: Text Data Preparation

Ensure the dataset contains one or more string columns suitable for full-text indexing. Text columns should contain the raw text content to be searched. Consider whether additional preprocessing (normalization, language detection) is needed before indexing.

Key considerations:

Text columns must be of string type (Utf8 or LargeUtf8)
Multiple text columns can each have their own inverted index
Consider text length distribution; very long documents may benefit from chunking
Ensure text encoding is consistent (UTF-8)

Step 2: Tokenizer Configuration

Select and configure the tokenizer that will split text into searchable terms. Lance supports multiple tokenizer types: the default tokenizer for general English text, language-specific tokenizers with stemming support, and n-gram tokenizers for substring matching. Tokenizer configuration determines how text is broken into indexable terms.

Key considerations:

The default tokenizer handles English well with basic normalization
Language-specific tokenizers apply stemming (e.g., "running" -> "run")
N-gram tokenizers enable partial match and fuzzy search
Tokenizer choice significantly impacts both index size and query quality

Step 3: Inverted Index Building

Build the inverted index on the selected text column. The index construction process tokenizes all text values, builds the term-to-document mapping (posting lists), computes term frequency and document frequency statistics for BM25 scoring, and writes the index structure to storage.

Key considerations:

Index building reads all text data in the column
Posting lists are compressed for storage efficiency
BM25 statistics (IDF, average document length) are computed during build
The index is recorded in the dataset manifest as a new version

Step 4: Full-Text Query Execution

Execute full-text search queries using keyword terms, boolean operators, or phrase matching. The search engine tokenizes the query using the same tokenizer as the index, looks up matching documents in the posting lists, computes BM25 relevance scores, and returns ranked results.

Key considerations:

Queries are tokenized with the same tokenizer used for indexing
Boolean queries support AND, OR, and NOT operators
Phrase queries match exact term sequences
Results are ranked by BM25 score by default

Step 5: Hybrid Search Composition

Combine full-text search with vector similarity search to leverage both lexical and semantic matching. Hybrid queries execute both search modalities and merge results using configurable fusion strategies. This addresses the complementary strengths of keyword matching (precise term recall) and vector search (semantic similarity).

Key considerations:

Both vector index and inverted index must exist on the dataset
Results from each modality are scored and merged
Hybrid search improves recall compared to either modality alone
Apply SQL filters to further narrow hybrid search results

Execution Diagram

GitHub URL

Workflow Repository