Principle:Lance format Lance Tokenizer Configuration

Knowledge Sources	Lance Lance Docs
Domains	Information_Retrieval, Full_Text_Search
Last Updated	2026-02-08 19:00 GMT

Overview

Tokenizer configuration defines how raw text is decomposed into indexable tokens through a pipeline of a base tokenizer followed by a chain of configurable token filters.

Description

Full-text search indexes operate on tokens -- discrete units extracted from source text. The quality and behavior of full-text search is fundamentally determined by the tokenization pipeline. Lance uses a pipeline architecture built on top of the Tantivy tokenizer framework, where a base tokenizer splits raw text into initial tokens, and a series of token filters transform those tokens before they are written to the inverted index.

The pipeline is fully configurable through the InvertedIndexParams struct. Every parameter has a sensible default so that InvertedIndexParams::default() produces a production-ready tokenizer for English text.

Usage

Configure tokenization whenever you need to:

Support a language other than English (change language, or use lindera/* / jieba/* base tokenizers for CJK)
Adjust precision vs. recall trade-offs (e.g., disable stemming for exact-match requirements)
Handle special text formats (use raw for pre-tokenized text, ngram for substring matching)
Optimize index size (disable position storage, adjust max token length)

Theoretical Basis

Tokenization Pipeline Architecture

The Lance tokenization pipeline processes text through the following stages in order:

Raw Text
    |
    v
[Base Tokenizer]  -- splits text into initial tokens
    |
    v
[RemoveLongFilter] -- removes tokens exceeding max_token_length (default: 40)
    |
    v
[LowerCaser]       -- converts all tokens to lowercase (default: enabled)
    |
    v
[Stemmer]          -- reduces tokens to word stems using language-specific rules (default: enabled, English)
    |
    v
[StopWordFilter]   -- removes common words like "the", "a", "is" (default: enabled, English)
    |
    v
[AsciiFoldingFilter] -- normalizes Unicode characters to ASCII equivalents (default: enabled)
    |
    v
Indexed Tokens

Each filter in the pipeline is optional and controlled by a boolean or option parameter. Filters that are disabled are simply skipped.

Base Tokenizer Strategies

Name	Behavior	Best For
`simple` (default)	Splits on whitespace and punctuation	General-purpose English and European languages
`whitespace`	Splits only on whitespace	Preserving punctuation as part of tokens
`raw`	No splitting; entire input is one token	Pre-tokenized data or exact-match indexing
`ngram`	Generates character N-grams of configurable length	Substring search, autocomplete
`lindera/*`	Morphological analysis for Japanese/Korean/Chinese	CJK text segmentation
`jieba/*`	Chinese word segmentation	Chinese text segmentation

Stemming

Stemming reduces inflected words to a common base form. For example, "running", "runs", and "ran" might all be reduced to "run". This increases recall (more documents match a query) at the cost of some precision.

Lance uses the Tantivy Stemmer which implements the Snowball stemming algorithm. The language parameter controls which Snowball rules are applied.

Stop Word Removal

Stop words are extremely common words that carry little semantic meaning (e.g., "the", "is", "at"). Removing them reduces index size and improves query performance. Lance supports both built-in stop word lists (per language) and custom stop word lists.

ASCII Folding

ASCII folding converts accented and special Unicode characters to their ASCII equivalents (e.g., "cafe" matches "cafe"). This is particularly useful for multilingual datasets or user-generated content with inconsistent diacritic usage.

N-Gram Tokenization

When the base tokenizer is set to ngram, the tokenizer generates all character subsequences of length between min_ngram_length (default: 3) and max_ngram_length (default: 3). The prefix_only option restricts N-grams to prefixes of the original tokens, which is useful for autocomplete-style queries.

Related Pages

Implemented By

Implementation:Lance_format_Lance_InvertedIndexParams

Uses Heuristic

Heuristic:Lance_format_Lance_BM25_FTS_Configuration

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment