Implementation:Lance format Lance InvertedIndexParams

Knowledge Sources	Lance
Domains	Information_Retrieval, Full_Text_Search
Last Updated	2026-02-08 19:00 GMT

Overview

Concrete tool for configuring the tokenization pipeline of a Lance inverted index, provided by the lance-index crate.

Description

InvertedIndexParams is a serializable configuration struct that controls every aspect of how text is tokenized for full-text search indexing. It uses a builder-style API where each setter method consumes and returns self, enabling fluent configuration chains. The struct implements Default, producing a production-ready configuration for English text with the simple base tokenizer and all standard filters enabled.

The build() method assembles the configured Tantivy tokenizer pipeline and returns a Box<dyn LanceTokenizer> ready for use by the inverted index builder.

Usage

Use InvertedIndexParams whenever creating a full-text search index on a Lance dataset. Pass the configured params to Dataset::create_index.

Code Reference

Source Location

Repository: Lance
File: rust/lance-index/src/scalar/inverted/tokenizer.rs
Lines: 28-98 (struct definition), 166-197 (constructor and defaults), 290-337 (pipeline build)

Signature

pub struct InvertedIndexParams {
    pub(crate) lance_tokenizer: Option<String>,
    pub(crate) base_tokenizer: String,
    pub(crate) language: tantivy::tokenizer::Language,
    pub(crate) with_position: bool,
    pub(crate) max_token_length: Option<usize>,
    pub(crate) lower_case: bool,
    pub(crate) stem: bool,
    pub(crate) remove_stop_words: bool,
    pub(crate) custom_stop_words: Option<Vec<String>>,
    pub(crate) ascii_folding: bool,
    pub(crate) min_ngram_length: u32,
    pub(crate) max_ngram_length: u32,
    pub(crate) prefix_only: bool,
    pub(crate) skip_merge: bool,
}

impl InvertedIndexParams {
    pub fn new(base_tokenizer: String, language: tantivy::tokenizer::Language) -> Self;
    pub fn default() -> Self; // equivalent to new("simple", Language::English)
    pub fn build(&self) -> Result<Box<dyn LanceTokenizer>>;
}

Import

use lance_index::scalar::InvertedIndexParams;
// or via re-export:
use lance_index::scalar::inverted::InvertedIndexParams;

I/O Contract

Inputs

Name	Type	Required	Description
base_tokenizer	`String`	No (default: `"simple"`)	Base tokenizer name: `"simple"`, `"whitespace"`, `"raw"`, `"ngram"`, `"lindera/"`, `"jieba/"`
language	`tantivy::tokenizer::Language`	No (default: `English`)	Language for stemming and stop word removal
with_position	`bool`	No (default: `false`)	Store token positions in the index (increases size, enables phrase queries)
max_token_length	`Option<usize>`	No (default: `Some(40)`)	Maximum token length; tokens longer than this are removed. `None` for no limit
lower_case	`bool`	No (default: `true`)	Convert tokens to lowercase
stem	`bool`	No (default: `true`)	Apply Snowball stemming using the configured language
remove_stop_words	`bool`	No (default: `true`)	Remove stop words using built-in or custom lists
custom_stop_words	`Option<Vec<String>>`	No (default: `None`)	Custom stop word list; overrides built-in list when set
ascii_folding	`bool`	No (default: `true`)	Fold Unicode characters to ASCII equivalents
min_ngram_length	`u32`	No (default: `3`)	Minimum N-gram length (only for `"ngram"` base tokenizer)
max_ngram_length	`u32`	No (default: `3`)	Maximum N-gram length (only for `"ngram"` base tokenizer)
prefix_only	`bool`	No (default: `false`)	Generate only prefix N-grams (only for `"ngram"` base tokenizer)
skip_merge	`bool`	No (default: `false`)	Skip partition merge after indexing (for distributed indexing)

Outputs

Name	Type	Description
tokenizer	`Box<dyn LanceTokenizer>`	Assembled tokenization pipeline ready for use by `InvertedIndexBuilder`

Usage Examples

Default Configuration (English)

use lance_index::scalar::InvertedIndexParams;

// Default: simple tokenizer, English, all filters enabled
let params = InvertedIndexParams::default();

Custom Configuration

use lance_index::scalar::InvertedIndexParams;

// Whitespace tokenizer, no stemming, with position storage
let params = InvertedIndexParams::default()
    .base_tokenizer("whitespace".to_string())
    .stem(false)
    .with_position(true)
    .max_token_length(Some(100));

N-Gram Configuration for Substring Search

use lance_index::scalar::InvertedIndexParams;
use tantivy::tokenizer::Language;

let params = InvertedIndexParams::new("ngram".to_string(), Language::English)
    .ngram_min_length(2)
    .ngram_max_length(4)
    .stem(false)
    .remove_stop_words(false);

Pipeline Build Order

The build() method assembles filters in the following fixed order:

// Pseudocode of the pipeline assembly in build()
let mut builder = build_base_tokenizer(); // "simple", "whitespace", "raw", "ngram", etc.
if max_token_length.is_some() { builder = builder.filter(RemoveLongFilter); }
if lower_case               { builder = builder.filter(LowerCaser); }
if stem                     { builder = builder.filter(Stemmer(language)); }
if remove_stop_words        { builder = builder.filter(StopWordFilter(language | custom)); }
if ascii_folding            { builder = builder.filter(AsciiFoldingFilter); }

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment