Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Lance format Lance Text Column Schema

From Leeroopedia


Knowledge Sources
Domains Information_Retrieval, Full_Text_Search
Last Updated 2026-02-08 19:00 GMT

Overview

Concrete pattern for defining text column schemas compatible with Lance full-text search indexing.

Description

This pattern describes the interface that users must follow when constructing an Arrow schema whose columns will be indexed for full-text search. The inverted index builder validates each column's DataType at training time and rejects unsupported types. Users must ensure their schema fields use one of the seven accepted data types before writing data to the dataset.

Usage

Apply this pattern whenever creating a new Lance dataset that will contain text data intended for full-text search. The schema definition happens before any data is written and before create_index is called.

Interface Specification

Source Location

  • Repository: Lance
  • File: rust/lance-index/src/scalar/inverted.rs
  • Lines: 112-134 (type validation in new_training_request)

Accepted Type Pattern

The column field passed to the inverted index must match one of the following Arrow DataType variants:

DataType Description Typical Use Case
DataType::Utf8 Standard variable-length UTF-8 string Short text fields, titles, labels
DataType::LargeUtf8 UTF-8 string with 64-bit offsets Full documents, articles, long-form text
DataType::LargeBinary Raw binary interpreted as text Pre-encoded or mixed-encoding content
DataType::List(Utf8) List of UTF-8 strings per row Tags, multi-paragraph fields
DataType::List(LargeUtf8) List of large UTF-8 strings per row Multiple large text segments per row
DataType::LargeList(Utf8) Large list (64-bit offsets) of UTF-8 strings Very large multi-valued text fields
DataType::LargeList(LargeUtf8) Large list of large UTF-8 strings Maximum-flexibility multi-valued text

Validation Logic

// From rust/lance-index/src/scalar/inverted.rs, lines 117-130
match field.data_type() {
    DataType::Utf8 | DataType::LargeUtf8 | DataType::LargeBinary => (),
    DataType::List(f) if matches!(f.data_type(), DataType::Utf8 | DataType::LargeUtf8) => (),
    DataType::LargeList(f) if matches!(f.data_type(), DataType::Utf8 | DataType::LargeUtf8) => (),
    _ => return Err(Error::InvalidInput {
        source: format!(
            "A inverted index can only be created on a Utf8 or LargeUtf8 field/list \
             or LargeBinary field. Column has type {:?}",
            field.data_type()
        ).into(),
        location: location!(),
    })
}

Usage Examples

Single Document Column (LargeUtf8)

use arrow_schema::{DataType, Field, Schema};
use std::sync::Arc;

// Define a schema with a single text column for full-text search
let schema = Arc::new(Schema::new(vec![
    Field::new("doc", DataType::LargeUtf8, false),
    Field::new("id", DataType::UInt64, false),
]));

Multi-Valued Field (List of Strings)

use arrow_schema::{DataType, Field, Schema};
use std::sync::Arc;

// Each row contains a list of text segments (e.g., paragraphs or tags)
let schema = Arc::new(Schema::new(vec![
    Field::new(
        "paragraphs",
        DataType::List(Arc::new(Field::new("item", DataType::Utf8, true))),
        false,
    ),
    Field::new("id", DataType::UInt64, false),
]));

Complete Example from the Lance Repository

// From rust/examples/src/full_text_search.rs, line 47
use arrow_schema::{DataType, Field, Schema};
use std::sync::Arc;

let schema = Arc::new(Schema::new(vec![
    Field::new("doc", DataType::LargeUtf8, false),
    Field::new("__example_doc_id", DataType::UInt64, false),
]));

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment