Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Lance format Lance Tokenizer Configuration

From Leeroopedia


Knowledge Sources
Domains Information_Retrieval, Full_Text_Search
Last Updated 2026-02-08 19:00 GMT

Overview

Tokenizer configuration defines how raw text is decomposed into indexable tokens through a pipeline of a base tokenizer followed by a chain of configurable token filters.

Description

Full-text search indexes operate on tokens -- discrete units extracted from source text. The quality and behavior of full-text search is fundamentally determined by the tokenization pipeline. Lance uses a pipeline architecture built on top of the Tantivy tokenizer framework, where a base tokenizer splits raw text into initial tokens, and a series of token filters transform those tokens before they are written to the inverted index.

The pipeline is fully configurable through the InvertedIndexParams struct. Every parameter has a sensible default so that InvertedIndexParams::default() produces a production-ready tokenizer for English text.

Usage

Configure tokenization whenever you need to:

  • Support a language other than English (change language, or use lindera/* / jieba/* base tokenizers for CJK)
  • Adjust precision vs. recall trade-offs (e.g., disable stemming for exact-match requirements)
  • Handle special text formats (use raw for pre-tokenized text, ngram for substring matching)
  • Optimize index size (disable position storage, adjust max token length)

Theoretical Basis

Tokenization Pipeline Architecture

The Lance tokenization pipeline processes text through the following stages in order:

Raw Text
    |
    v
[Base Tokenizer]  -- splits text into initial tokens
    |
    v
[RemoveLongFilter] -- removes tokens exceeding max_token_length (default: 40)
    |
    v
[LowerCaser]       -- converts all tokens to lowercase (default: enabled)
    |
    v
[Stemmer]          -- reduces tokens to word stems using language-specific rules (default: enabled, English)
    |
    v
[StopWordFilter]   -- removes common words like "the", "a", "is" (default: enabled, English)
    |
    v
[AsciiFoldingFilter] -- normalizes Unicode characters to ASCII equivalents (default: enabled)
    |
    v
Indexed Tokens

Each filter in the pipeline is optional and controlled by a boolean or option parameter. Filters that are disabled are simply skipped.

Base Tokenizer Strategies

Name Behavior Best For
simple (default) Splits on whitespace and punctuation General-purpose English and European languages
whitespace Splits only on whitespace Preserving punctuation as part of tokens
raw No splitting; entire input is one token Pre-tokenized data or exact-match indexing
ngram Generates character N-grams of configurable length Substring search, autocomplete
lindera/* Morphological analysis for Japanese/Korean/Chinese CJK text segmentation
jieba/* Chinese word segmentation Chinese text segmentation

Stemming

Stemming reduces inflected words to a common base form. For example, "running", "runs", and "ran" might all be reduced to "run". This increases recall (more documents match a query) at the cost of some precision.

Lance uses the Tantivy Stemmer which implements the Snowball stemming algorithm. The language parameter controls which Snowball rules are applied.

Stop Word Removal

Stop words are extremely common words that carry little semantic meaning (e.g., "the", "is", "at"). Removing them reduces index size and improves query performance. Lance supports both built-in stop word lists (per language) and custom stop word lists.

ASCII Folding

ASCII folding converts accented and special Unicode characters to their ASCII equivalents (e.g., "cafe" matches "cafe"). This is particularly useful for multilingual datasets or user-generated content with inconsistent diacritic usage.

N-Gram Tokenization

When the base tokenizer is set to ngram, the tokenizer generates all character subsequences of length between min_ngram_length (default: 3) and max_ngram_length (default: 3). The prefix_only option restricts N-grams to prefixes of the original tokens, which is useful for autocomplete-style queries.

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment