Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Lance format Lance Inverted Index Building

From Leeroopedia


Knowledge Sources
Domains Information_Retrieval, Full_Text_Search
Last Updated 2026-02-08 19:00 GMT

Overview

Inverted index building is the process of scanning all text data in a column, tokenizing it, and constructing a persistent on-disk inverted index that maps tokens to the documents (rows) containing them.

Description

An inverted index is the foundational data structure for full-text search. It maintains a mapping from each unique token to a posting list -- an ordered sequence of (row_id, frequency) pairs indicating which rows contain the token and how many times. Lance additionally stores per-document token counts and corpus-level statistics needed for BM25 scoring at query time.

The build process in Lance follows a sharded, partitioned architecture designed for parallelism and bounded memory usage. Data is streamed through the tokenization pipeline in partitions, each shard processes its portion independently, and then partitions are merged into larger segments controlled by size thresholds.

Usage

Build an inverted index whenever you want to enable full-text search on a text column. The index must be built after the data is written to the dataset. Re-indexing is required after significant data changes, or the replace flag can be set to rebuild.

Theoretical Basis

Inverted Index Structure

An inverted index consists of:

  1. Token Dictionary -- a sorted mapping from token strings to token IDs
  2. Posting Lists -- for each token ID, a compressed list of (row_id, term_frequency) pairs, organized in blocks of 128 elements using BitPacker4x bit-packing compression
  3. Document Table -- per-document metadata including the total token count, used for BM25 length normalization
  4. Corpus Statistics -- total document count and total token count across the entire indexed column

Sharded Build Process

The indexing pipeline operates as follows:

Dataset Column
    |
    v
[Stream RecordBatches]
    |
    v
[Shard across N workers]  -- N = LANCE_FTS_NUM_SHARDS (default: num CPUs)
    |
    v
[Per-shard tokenization & local inverted index build]
    |
    v
[Write partitions to temporary storage]  -- partition size = LANCE_FTS_PARTITION_SIZE MiB (default: 256)
    |
    v
[Size-based merge]  -- target size = LANCE_FTS_TARGET_SIZE MiB (default: 4096)
    |
    v
[Write final index to object store]

Environment Variables

Variable Default Description
LANCE_FTS_NUM_SHARDS Number of compute CPUs Number of parallel indexing workers
LANCE_FTS_PARTITION_SIZE 256 MiB Maximum uncompressed partition size before flushing
LANCE_FTS_TARGET_SIZE 4096 MiB Target partition size after merging

These variables allow tuning the trade-off between indexing speed, memory consumption, and final index layout.

Posting List Compression

Posting lists are stored in blocks of 128 elements (the BLOCK_SIZE constant, matching BitPacker4x::BLOCK_LEN). Within each block, row IDs and frequencies are delta-encoded and bit-packed using the BitPacker4x algorithm, which exploits SIMD instructions for fast compression and decompression. This approach achieves high compression ratios for the typically skewed distributions of term frequencies.

Build Flow in the Codebase

The internal call chain from the public API to the builder is:

Dataset::create_index(&["col"], IndexType::Inverted, name, &params, replace)
    -> CreateIndexBuilder::build()
        -> build_scalar_index()
            -> ScalarIndexPlugin::train_index()
                -> InvertedIndexPlugin::train_inverted_index(data_stream, store, params)
                    -> InvertedIndexBuilder::new(params)
                    -> InvertedIndexBuilder::update(data_stream, store)

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment