Principle:Lance format Lance Tokenizer Configuration
| Knowledge Sources | |
|---|---|
| Domains | Information_Retrieval, Full_Text_Search |
| Last Updated | 2026-02-08 19:00 GMT |
Overview
Tokenizer configuration defines how raw text is decomposed into indexable tokens through a pipeline of a base tokenizer followed by a chain of configurable token filters.
Description
Full-text search indexes operate on tokens -- discrete units extracted from source text. The quality and behavior of full-text search is fundamentally determined by the tokenization pipeline. Lance uses a pipeline architecture built on top of the Tantivy tokenizer framework, where a base tokenizer splits raw text into initial tokens, and a series of token filters transform those tokens before they are written to the inverted index.
The pipeline is fully configurable through the InvertedIndexParams struct. Every parameter has a sensible default so that InvertedIndexParams::default() produces a production-ready tokenizer for English text.
Usage
Configure tokenization whenever you need to:
- Support a language other than English (change
language, or uselindera/*/jieba/*base tokenizers for CJK) - Adjust precision vs. recall trade-offs (e.g., disable stemming for exact-match requirements)
- Handle special text formats (use
rawfor pre-tokenized text,ngramfor substring matching) - Optimize index size (disable position storage, adjust max token length)
Theoretical Basis
Tokenization Pipeline Architecture
The Lance tokenization pipeline processes text through the following stages in order:
Raw Text
|
v
[Base Tokenizer] -- splits text into initial tokens
|
v
[RemoveLongFilter] -- removes tokens exceeding max_token_length (default: 40)
|
v
[LowerCaser] -- converts all tokens to lowercase (default: enabled)
|
v
[Stemmer] -- reduces tokens to word stems using language-specific rules (default: enabled, English)
|
v
[StopWordFilter] -- removes common words like "the", "a", "is" (default: enabled, English)
|
v
[AsciiFoldingFilter] -- normalizes Unicode characters to ASCII equivalents (default: enabled)
|
v
Indexed Tokens
Each filter in the pipeline is optional and controlled by a boolean or option parameter. Filters that are disabled are simply skipped.
Base Tokenizer Strategies
| Name | Behavior | Best For |
|---|---|---|
simple (default) |
Splits on whitespace and punctuation | General-purpose English and European languages |
whitespace |
Splits only on whitespace | Preserving punctuation as part of tokens |
raw |
No splitting; entire input is one token | Pre-tokenized data or exact-match indexing |
ngram |
Generates character N-grams of configurable length | Substring search, autocomplete |
lindera/* |
Morphological analysis for Japanese/Korean/Chinese | CJK text segmentation |
jieba/* |
Chinese word segmentation | Chinese text segmentation |
Stemming
Stemming reduces inflected words to a common base form. For example, "running", "runs", and "ran" might all be reduced to "run". This increases recall (more documents match a query) at the cost of some precision.
Lance uses the Tantivy Stemmer which implements the Snowball stemming algorithm. The language parameter controls which Snowball rules are applied.
Stop Word Removal
Stop words are extremely common words that carry little semantic meaning (e.g., "the", "is", "at"). Removing them reduces index size and improves query performance. Lance supports both built-in stop word lists (per language) and custom stop word lists.
ASCII Folding
ASCII folding converts accented and special Unicode characters to their ASCII equivalents (e.g., "cafe" matches "cafe"). This is particularly useful for multilingual datasets or user-generated content with inconsistent diacritic usage.
N-Gram Tokenization
When the base tokenizer is set to ngram, the tokenizer generates all character subsequences of length between min_ngram_length (default: 3) and max_ngram_length (default: 3). The prefix_only option restricts N-grams to prefixes of the original tokens, which is useful for autocomplete-style queries.