Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Lance format Lance Text Data Preparation

From Leeroopedia
Revision as of 18:07, 16 February 2026 by Admin (talk | contribs) (Auto-imported from principles/Lance_format_Lance_Text_Data_Preparation.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Information_Retrieval, Full_Text_Search
Last Updated 2026-02-08 19:00 GMT

Overview

Text data preparation is the process of defining and validating the columnar schema so that text columns are stored in a Lance dataset using data types compatible with full-text search indexing.

Description

Before any full-text search index can be built, the underlying data must reside in a column whose Arrow data type is recognized by the inverted index builder. Lance validates the column type at index creation time and rejects unsupported types with a clear error message.

The supported Arrow data types for full-text search columns are:

  • Utf8 -- standard variable-length UTF-8 string
  • LargeUtf8 -- variable-length UTF-8 string with 64-bit offsets, suitable for documents that may exceed 2 GiB in aggregate
  • LargeBinary -- raw binary data that is interpreted as text during tokenization
  • List(Utf8) -- a list of UTF-8 strings per row (multi-valued field)
  • List(LargeUtf8) -- a list of large UTF-8 strings per row
  • LargeList(Utf8) -- a large list of UTF-8 strings per row (64-bit offsets for the list)
  • LargeList(LargeUtf8) -- a large list of large UTF-8 strings per row

Choosing the correct type affects memory layout, maximum document size, and whether a single row can contain multiple independent text values (list types).

Usage

Use this principle whenever you are designing a Lance schema that will later be indexed for full-text search. The schema must be established before writing data and before calling create_index. Selecting the wrong data type will cause the index creation to fail.

Common guidelines:

  • For single-document columns with moderate size, use LargeUtf8 (the most common choice, as seen in the official examples).
  • For columns where each row contains multiple text values (e.g., tags, paragraphs stored separately), use List(Utf8) or List(LargeUtf8).
  • For raw byte content that represents text, use LargeBinary.

Theoretical Basis

Arrow Type System and Columnar Storage

Apache Arrow defines a strict type system for columnar data. Each column in a record batch carries a DataType that governs how the raw bytes are interpreted. Lance inherits this type system and adds validation at the index layer.

The type validation logic is implemented as a pattern match on the column's DataType:

match field.data_type() {
    Utf8 | LargeUtf8 | LargeBinary              => OK
    List(f)      if f is Utf8 or LargeUtf8       => OK
    LargeList(f) if f is Utf8 or LargeUtf8       => OK
    _                                             => Error
}

This exhaustive check ensures that only columns containing textual or text-list data enter the tokenization pipeline.

Multi-Valued Fields

When a column uses a list type (List(Utf8), etc.), the inverted index builder iterates over each element in the list and tokenizes them independently. All tokens from all list elements contribute to the same document's term frequencies. This enables scenarios such as tagging, multi-paragraph storage, or pre-split sentence columns.

Schema Immutability

Lance schemas are immutable once a dataset is created. If the wrong data type is chosen, the dataset must be rewritten or a new column must be added. Therefore, careful upfront planning of the text column type is essential.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment