Principle:Lance format Lance Inverted Index Building

Knowledge Sources	Lance Lance Docs
Domains	Information_Retrieval, Full_Text_Search
Last Updated	2026-02-08 19:00 GMT

Overview

Inverted index building is the process of scanning all text data in a column, tokenizing it, and constructing a persistent on-disk inverted index that maps tokens to the documents (rows) containing them.

Description

An inverted index is the foundational data structure for full-text search. It maintains a mapping from each unique token to a posting list -- an ordered sequence of (row_id, frequency) pairs indicating which rows contain the token and how many times. Lance additionally stores per-document token counts and corpus-level statistics needed for BM25 scoring at query time.

The build process in Lance follows a sharded, partitioned architecture designed for parallelism and bounded memory usage. Data is streamed through the tokenization pipeline in partitions, each shard processes its portion independently, and then partitions are merged into larger segments controlled by size thresholds.

Usage

Build an inverted index whenever you want to enable full-text search on a text column. The index must be built after the data is written to the dataset. Re-indexing is required after significant data changes, or the replace flag can be set to rebuild.

Theoretical Basis

Inverted Index Structure

An inverted index consists of:

Token Dictionary -- a sorted mapping from token strings to token IDs
Posting Lists -- for each token ID, a compressed list of (row_id, term_frequency) pairs, organized in blocks of 128 elements using BitPacker4x bit-packing compression
Document Table -- per-document metadata including the total token count, used for BM25 length normalization
Corpus Statistics -- total document count and total token count across the entire indexed column

Sharded Build Process

The indexing pipeline operates as follows:

Dataset Column
    |
    v
[Stream RecordBatches]
    |
    v
[Shard across N workers]  -- N = LANCE_FTS_NUM_SHARDS (default: num CPUs)
    |
    v
[Per-shard tokenization & local inverted index build]
    |
    v
[Write partitions to temporary storage]  -- partition size = LANCE_FTS_PARTITION_SIZE MiB (default: 256)
    |
    v
[Size-based merge]  -- target size = LANCE_FTS_TARGET_SIZE MiB (default: 4096)
    |
    v
[Write final index to object store]

Environment Variables

Variable	Default	Description
`LANCE_FTS_NUM_SHARDS`	Number of compute CPUs	Number of parallel indexing workers
`LANCE_FTS_PARTITION_SIZE`	256 MiB	Maximum uncompressed partition size before flushing
`LANCE_FTS_TARGET_SIZE`	4096 MiB	Target partition size after merging

These variables allow tuning the trade-off between indexing speed, memory consumption, and final index layout.

Posting List Compression

Posting lists are stored in blocks of 128 elements (the BLOCK_SIZE constant, matching BitPacker4x::BLOCK_LEN). Within each block, row IDs and frequencies are delta-encoded and bit-packed using the BitPacker4x algorithm, which exploits SIMD instructions for fast compression and decompression. This approach achieves high compression ratios for the typically skewed distributions of term frequencies.

Build Flow in the Codebase

The internal call chain from the public API to the builder is:

Dataset::create_index(&["col"], IndexType::Inverted, name, &params, replace)
    -> CreateIndexBuilder::build()
        -> build_scalar_index()
            -> ScalarIndexPlugin::train_index()
                -> InvertedIndexPlugin::train_inverted_index(data_stream, store, params)
                    -> InvertedIndexBuilder::new(params)
                    -> InvertedIndexBuilder::update(data_stream, store)

Related Pages

Implemented By

Implementation:Lance_format_Lance_Create_Inverted_Index

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment