Implementation:Lance format Lance Text Column Schema
| Knowledge Sources | |
|---|---|
| Domains | Information_Retrieval, Full_Text_Search |
| Last Updated | 2026-02-08 19:00 GMT |
Overview
Concrete pattern for defining text column schemas compatible with Lance full-text search indexing.
Description
This pattern describes the interface that users must follow when constructing an Arrow schema whose columns will be indexed for full-text search. The inverted index builder validates each column's DataType at training time and rejects unsupported types. Users must ensure their schema fields use one of the seven accepted data types before writing data to the dataset.
Usage
Apply this pattern whenever creating a new Lance dataset that will contain text data intended for full-text search. The schema definition happens before any data is written and before create_index is called.
Interface Specification
Source Location
- Repository: Lance
- File:
rust/lance-index/src/scalar/inverted.rs - Lines: 112-134 (type validation in
new_training_request)
Accepted Type Pattern
The column field passed to the inverted index must match one of the following Arrow DataType variants:
| DataType | Description | Typical Use Case |
|---|---|---|
DataType::Utf8 |
Standard variable-length UTF-8 string | Short text fields, titles, labels |
DataType::LargeUtf8 |
UTF-8 string with 64-bit offsets | Full documents, articles, long-form text |
DataType::LargeBinary |
Raw binary interpreted as text | Pre-encoded or mixed-encoding content |
DataType::List(Utf8) |
List of UTF-8 strings per row | Tags, multi-paragraph fields |
DataType::List(LargeUtf8) |
List of large UTF-8 strings per row | Multiple large text segments per row |
DataType::LargeList(Utf8) |
Large list (64-bit offsets) of UTF-8 strings | Very large multi-valued text fields |
DataType::LargeList(LargeUtf8) |
Large list of large UTF-8 strings | Maximum-flexibility multi-valued text |
Validation Logic
// From rust/lance-index/src/scalar/inverted.rs, lines 117-130
match field.data_type() {
DataType::Utf8 | DataType::LargeUtf8 | DataType::LargeBinary => (),
DataType::List(f) if matches!(f.data_type(), DataType::Utf8 | DataType::LargeUtf8) => (),
DataType::LargeList(f) if matches!(f.data_type(), DataType::Utf8 | DataType::LargeUtf8) => (),
_ => return Err(Error::InvalidInput {
source: format!(
"A inverted index can only be created on a Utf8 or LargeUtf8 field/list \
or LargeBinary field. Column has type {:?}",
field.data_type()
).into(),
location: location!(),
})
}
Usage Examples
Single Document Column (LargeUtf8)
use arrow_schema::{DataType, Field, Schema};
use std::sync::Arc;
// Define a schema with a single text column for full-text search
let schema = Arc::new(Schema::new(vec![
Field::new("doc", DataType::LargeUtf8, false),
Field::new("id", DataType::UInt64, false),
]));
Multi-Valued Field (List of Strings)
use arrow_schema::{DataType, Field, Schema};
use std::sync::Arc;
// Each row contains a list of text segments (e.g., paragraphs or tags)
let schema = Arc::new(Schema::new(vec![
Field::new(
"paragraphs",
DataType::List(Arc::new(Field::new("item", DataType::Utf8, true))),
false,
),
Field::new("id", DataType::UInt64, false),
]));
Complete Example from the Lance Repository
// From rust/examples/src/full_text_search.rs, line 47
use arrow_schema::{DataType, Field, Schema};
use std::sync::Arc;
let schema = Arc::new(Schema::new(vec![
Field::new("doc", DataType::LargeUtf8, false),
Field::new("__example_doc_id", DataType::UInt64, false),
]));