Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Lance format Lance InvertedIndexParams

From Leeroopedia


Knowledge Sources
Domains Information_Retrieval, Full_Text_Search
Last Updated 2026-02-08 19:00 GMT

Overview

Concrete tool for configuring the tokenization pipeline of a Lance inverted index, provided by the lance-index crate.

Description

InvertedIndexParams is a serializable configuration struct that controls every aspect of how text is tokenized for full-text search indexing. It uses a builder-style API where each setter method consumes and returns self, enabling fluent configuration chains. The struct implements Default, producing a production-ready configuration for English text with the simple base tokenizer and all standard filters enabled.

The build() method assembles the configured Tantivy tokenizer pipeline and returns a Box<dyn LanceTokenizer> ready for use by the inverted index builder.

Usage

Use InvertedIndexParams whenever creating a full-text search index on a Lance dataset. Pass the configured params to Dataset::create_index.

Code Reference

Source Location

  • Repository: Lance
  • File: rust/lance-index/src/scalar/inverted/tokenizer.rs
  • Lines: 28-98 (struct definition), 166-197 (constructor and defaults), 290-337 (pipeline build)

Signature

pub struct InvertedIndexParams {
    pub(crate) lance_tokenizer: Option<String>,
    pub(crate) base_tokenizer: String,
    pub(crate) language: tantivy::tokenizer::Language,
    pub(crate) with_position: bool,
    pub(crate) max_token_length: Option<usize>,
    pub(crate) lower_case: bool,
    pub(crate) stem: bool,
    pub(crate) remove_stop_words: bool,
    pub(crate) custom_stop_words: Option<Vec<String>>,
    pub(crate) ascii_folding: bool,
    pub(crate) min_ngram_length: u32,
    pub(crate) max_ngram_length: u32,
    pub(crate) prefix_only: bool,
    pub(crate) skip_merge: bool,
}

impl InvertedIndexParams {
    pub fn new(base_tokenizer: String, language: tantivy::tokenizer::Language) -> Self;
    pub fn default() -> Self; // equivalent to new("simple", Language::English)
    pub fn build(&self) -> Result<Box<dyn LanceTokenizer>>;
}

Import

use lance_index::scalar::InvertedIndexParams;
// or via re-export:
use lance_index::scalar::inverted::InvertedIndexParams;

I/O Contract

Inputs

Name Type Required Description
base_tokenizer String No (default: "simple") Base tokenizer name: "simple", "whitespace", "raw", "ngram", "lindera/*", "jieba/*"
language tantivy::tokenizer::Language No (default: English) Language for stemming and stop word removal
with_position bool No (default: false) Store token positions in the index (increases size, enables phrase queries)
max_token_length Option<usize> No (default: Some(40)) Maximum token length; tokens longer than this are removed. None for no limit
lower_case bool No (default: true) Convert tokens to lowercase
stem bool No (default: true) Apply Snowball stemming using the configured language
remove_stop_words bool No (default: true) Remove stop words using built-in or custom lists
custom_stop_words Option<Vec<String>> No (default: None) Custom stop word list; overrides built-in list when set
ascii_folding bool No (default: true) Fold Unicode characters to ASCII equivalents
min_ngram_length u32 No (default: 3) Minimum N-gram length (only for "ngram" base tokenizer)
max_ngram_length u32 No (default: 3) Maximum N-gram length (only for "ngram" base tokenizer)
prefix_only bool No (default: false) Generate only prefix N-grams (only for "ngram" base tokenizer)
skip_merge bool No (default: false) Skip partition merge after indexing (for distributed indexing)

Outputs

Name Type Description
tokenizer Box<dyn LanceTokenizer> Assembled tokenization pipeline ready for use by InvertedIndexBuilder

Usage Examples

Default Configuration (English)

use lance_index::scalar::InvertedIndexParams;

// Default: simple tokenizer, English, all filters enabled
let params = InvertedIndexParams::default();

Custom Configuration

use lance_index::scalar::InvertedIndexParams;

// Whitespace tokenizer, no stemming, with position storage
let params = InvertedIndexParams::default()
    .base_tokenizer("whitespace".to_string())
    .stem(false)
    .with_position(true)
    .max_token_length(Some(100));

N-Gram Configuration for Substring Search

use lance_index::scalar::InvertedIndexParams;
use tantivy::tokenizer::Language;

let params = InvertedIndexParams::new("ngram".to_string(), Language::English)
    .ngram_min_length(2)
    .ngram_max_length(4)
    .stem(false)
    .remove_stop_words(false);

Pipeline Build Order

The build() method assembles filters in the following fixed order:

// Pseudocode of the pipeline assembly in build()
let mut builder = build_base_tokenizer(); // "simple", "whitespace", "raw", "ngram", etc.
if max_token_length.is_some() { builder = builder.filter(RemoveLongFilter); }
if lower_case               { builder = builder.filter(LowerCaser); }
if stem                     { builder = builder.filter(Stemmer(language)); }
if remove_stop_words        { builder = builder.filter(StopWordFilter(language | custom)); }
if ascii_folding            { builder = builder.filter(AsciiFoldingFilter); }

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment