Implementation:Lance format Lance InvertedIndexParams
| Knowledge Sources | |
|---|---|
| Domains | Information_Retrieval, Full_Text_Search |
| Last Updated | 2026-02-08 19:00 GMT |
Overview
Concrete tool for configuring the tokenization pipeline of a Lance inverted index, provided by the lance-index crate.
Description
InvertedIndexParams is a serializable configuration struct that controls every aspect of how text is tokenized for full-text search indexing. It uses a builder-style API where each setter method consumes and returns self, enabling fluent configuration chains. The struct implements Default, producing a production-ready configuration for English text with the simple base tokenizer and all standard filters enabled.
The build() method assembles the configured Tantivy tokenizer pipeline and returns a Box<dyn LanceTokenizer> ready for use by the inverted index builder.
Usage
Use InvertedIndexParams whenever creating a full-text search index on a Lance dataset. Pass the configured params to Dataset::create_index.
Code Reference
Source Location
- Repository: Lance
- File:
rust/lance-index/src/scalar/inverted/tokenizer.rs - Lines: 28-98 (struct definition), 166-197 (constructor and defaults), 290-337 (pipeline build)
Signature
pub struct InvertedIndexParams {
pub(crate) lance_tokenizer: Option<String>,
pub(crate) base_tokenizer: String,
pub(crate) language: tantivy::tokenizer::Language,
pub(crate) with_position: bool,
pub(crate) max_token_length: Option<usize>,
pub(crate) lower_case: bool,
pub(crate) stem: bool,
pub(crate) remove_stop_words: bool,
pub(crate) custom_stop_words: Option<Vec<String>>,
pub(crate) ascii_folding: bool,
pub(crate) min_ngram_length: u32,
pub(crate) max_ngram_length: u32,
pub(crate) prefix_only: bool,
pub(crate) skip_merge: bool,
}
impl InvertedIndexParams {
pub fn new(base_tokenizer: String, language: tantivy::tokenizer::Language) -> Self;
pub fn default() -> Self; // equivalent to new("simple", Language::English)
pub fn build(&self) -> Result<Box<dyn LanceTokenizer>>;
}
Import
use lance_index::scalar::InvertedIndexParams;
// or via re-export:
use lance_index::scalar::inverted::InvertedIndexParams;
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| base_tokenizer | String |
No (default: "simple") |
Base tokenizer name: "simple", "whitespace", "raw", "ngram", "lindera/*", "jieba/*"
|
| language | tantivy::tokenizer::Language |
No (default: English) |
Language for stemming and stop word removal |
| with_position | bool |
No (default: false) |
Store token positions in the index (increases size, enables phrase queries) |
| max_token_length | Option<usize> |
No (default: Some(40)) |
Maximum token length; tokens longer than this are removed. None for no limit
|
| lower_case | bool |
No (default: true) |
Convert tokens to lowercase |
| stem | bool |
No (default: true) |
Apply Snowball stemming using the configured language |
| remove_stop_words | bool |
No (default: true) |
Remove stop words using built-in or custom lists |
| custom_stop_words | Option<Vec<String>> |
No (default: None) |
Custom stop word list; overrides built-in list when set |
| ascii_folding | bool |
No (default: true) |
Fold Unicode characters to ASCII equivalents |
| min_ngram_length | u32 |
No (default: 3) |
Minimum N-gram length (only for "ngram" base tokenizer)
|
| max_ngram_length | u32 |
No (default: 3) |
Maximum N-gram length (only for "ngram" base tokenizer)
|
| prefix_only | bool |
No (default: false) |
Generate only prefix N-grams (only for "ngram" base tokenizer)
|
| skip_merge | bool |
No (default: false) |
Skip partition merge after indexing (for distributed indexing) |
Outputs
| Name | Type | Description |
|---|---|---|
| tokenizer | Box<dyn LanceTokenizer> |
Assembled tokenization pipeline ready for use by InvertedIndexBuilder
|
Usage Examples
Default Configuration (English)
use lance_index::scalar::InvertedIndexParams;
// Default: simple tokenizer, English, all filters enabled
let params = InvertedIndexParams::default();
Custom Configuration
use lance_index::scalar::InvertedIndexParams;
// Whitespace tokenizer, no stemming, with position storage
let params = InvertedIndexParams::default()
.base_tokenizer("whitespace".to_string())
.stem(false)
.with_position(true)
.max_token_length(Some(100));
N-Gram Configuration for Substring Search
use lance_index::scalar::InvertedIndexParams;
use tantivy::tokenizer::Language;
let params = InvertedIndexParams::new("ngram".to_string(), Language::English)
.ngram_min_length(2)
.ngram_max_length(4)
.stem(false)
.remove_stop_words(false);
Pipeline Build Order
The build() method assembles filters in the following fixed order:
// Pseudocode of the pipeline assembly in build()
let mut builder = build_base_tokenizer(); // "simple", "whitespace", "raw", "ngram", etc.
if max_token_length.is_some() { builder = builder.filter(RemoveLongFilter); }
if lower_case { builder = builder.filter(LowerCaser); }
if stem { builder = builder.filter(Stemmer(language)); }
if remove_stop_words { builder = builder.filter(StopWordFilter(language | custom)); }
if ascii_folding { builder = builder.filter(AsciiFoldingFilter); }