Principle:Lance format Lance Text Data Preparation
| Knowledge Sources | |
|---|---|
| Domains | Information_Retrieval, Full_Text_Search |
| Last Updated | 2026-02-08 19:00 GMT |
Overview
Text data preparation is the process of defining and validating the columnar schema so that text columns are stored in a Lance dataset using data types compatible with full-text search indexing.
Description
Before any full-text search index can be built, the underlying data must reside in a column whose Arrow data type is recognized by the inverted index builder. Lance validates the column type at index creation time and rejects unsupported types with a clear error message.
The supported Arrow data types for full-text search columns are:
- Utf8 -- standard variable-length UTF-8 string
- LargeUtf8 -- variable-length UTF-8 string with 64-bit offsets, suitable for documents that may exceed 2 GiB in aggregate
- LargeBinary -- raw binary data that is interpreted as text during tokenization
- List(Utf8) -- a list of UTF-8 strings per row (multi-valued field)
- List(LargeUtf8) -- a list of large UTF-8 strings per row
- LargeList(Utf8) -- a large list of UTF-8 strings per row (64-bit offsets for the list)
- LargeList(LargeUtf8) -- a large list of large UTF-8 strings per row
Choosing the correct type affects memory layout, maximum document size, and whether a single row can contain multiple independent text values (list types).
Usage
Use this principle whenever you are designing a Lance schema that will later be indexed for full-text search. The schema must be established before writing data and before calling create_index. Selecting the wrong data type will cause the index creation to fail.
Common guidelines:
- For single-document columns with moderate size, use LargeUtf8 (the most common choice, as seen in the official examples).
- For columns where each row contains multiple text values (e.g., tags, paragraphs stored separately), use List(Utf8) or List(LargeUtf8).
- For raw byte content that represents text, use LargeBinary.
Theoretical Basis
Arrow Type System and Columnar Storage
Apache Arrow defines a strict type system for columnar data. Each column in a record batch carries a DataType that governs how the raw bytes are interpreted. Lance inherits this type system and adds validation at the index layer.
The type validation logic is implemented as a pattern match on the column's DataType:
match field.data_type() {
Utf8 | LargeUtf8 | LargeBinary => OK
List(f) if f is Utf8 or LargeUtf8 => OK
LargeList(f) if f is Utf8 or LargeUtf8 => OK
_ => Error
}
This exhaustive check ensures that only columns containing textual or text-list data enter the tokenization pipeline.
Multi-Valued Fields
When a column uses a list type (List(Utf8), etc.), the inverted index builder iterates over each element in the list and tokenizes them independently. All tokens from all list elements contribute to the same document's term frequencies. This enables scenarios such as tagging, multi-paragraph storage, or pre-split sentence columns.
Schema Immutability
Lance schemas are immutable once a dataset is created. If the wrong data type is chosen, the dataset must be rewritten or a new column must be added. Therefore, careful upfront planning of the text column type is essential.