Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Lance format Lance Full Text Search

From Leeroopedia
Revision as of 11:02, 16 February 2026 by Admin (talk | contribs) (Auto-imported from workflows/Lance_format_Lance_Full_Text_Search.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Information_Retrieval, Full_Text_Search, ML_Ops
Last Updated 2026-02-08 19:00 GMT

Overview

End-to-end process for building inverted indices on text columns and performing full-text search queries with optional hybrid vector-text retrieval in a Lance dataset.

Description

This workflow covers the full-text search pipeline from tokenizer configuration through inverted index construction to query execution. Lance supports inverted indices with configurable tokenizers (including language-specific stemmers and n-gram tokenizers), BM25-based relevance scoring, and boolean query composition. Full-text search can be combined with vector similarity search for hybrid retrieval that leverages both semantic and lexical matching.

Usage

Execute this workflow when you have text data (documents, product descriptions, log messages) and need keyword-based search, when you want to combine lexical matching with vector similarity for hybrid retrieval, or when you need to filter datasets by text content patterns (e.g., log analysis, content moderation).

Execution Steps

Step 1: Text Data Preparation

Ensure the dataset contains one or more string columns suitable for full-text indexing. Text columns should contain the raw text content to be searched. Consider whether additional preprocessing (normalization, language detection) is needed before indexing.

Key considerations:

  • Text columns must be of string type (Utf8 or LargeUtf8)
  • Multiple text columns can each have their own inverted index
  • Consider text length distribution; very long documents may benefit from chunking
  • Ensure text encoding is consistent (UTF-8)

Step 2: Tokenizer Configuration

Select and configure the tokenizer that will split text into searchable terms. Lance supports multiple tokenizer types: the default tokenizer for general English text, language-specific tokenizers with stemming support, and n-gram tokenizers for substring matching. Tokenizer configuration determines how text is broken into indexable terms.

Key considerations:

  • The default tokenizer handles English well with basic normalization
  • Language-specific tokenizers apply stemming (e.g., "running" -> "run")
  • N-gram tokenizers enable partial match and fuzzy search
  • Tokenizer choice significantly impacts both index size and query quality

Step 3: Inverted Index Building

Build the inverted index on the selected text column. The index construction process tokenizes all text values, builds the term-to-document mapping (posting lists), computes term frequency and document frequency statistics for BM25 scoring, and writes the index structure to storage.

Key considerations:

  • Index building reads all text data in the column
  • Posting lists are compressed for storage efficiency
  • BM25 statistics (IDF, average document length) are computed during build
  • The index is recorded in the dataset manifest as a new version

Step 4: Full-Text Query Execution

Execute full-text search queries using keyword terms, boolean operators, or phrase matching. The search engine tokenizes the query using the same tokenizer as the index, looks up matching documents in the posting lists, computes BM25 relevance scores, and returns ranked results.

Key considerations:

  • Queries are tokenized with the same tokenizer used for indexing
  • Boolean queries support AND, OR, and NOT operators
  • Phrase queries match exact term sequences
  • Results are ranked by BM25 score by default

Step 5: Hybrid Search Composition

Combine full-text search with vector similarity search to leverage both lexical and semantic matching. Hybrid queries execute both search modalities and merge results using configurable fusion strategies. This addresses the complementary strengths of keyword matching (precise term recall) and vector search (semantic similarity).

Key considerations:

  • Both vector index and inverted index must exist on the dataset
  • Results from each modality are scored and merged
  • Hybrid search improves recall compared to either modality alone
  • Apply SQL filters to further narrow hybrid search results

Execution Diagram

GitHub URL

Workflow Repository