Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Lance format Lance Schema Definition And Dataset Creation

From Leeroopedia


Knowledge Sources
Domains Data_Engineering, Columnar_Storage
Last Updated 2026-02-08 19:00 GMT

Overview

Schema definition and dataset creation is the foundational step in the Lance dataset lifecycle, where an Arrow schema is bound to a storage destination and the initial data is persisted as a versioned Lance dataset.

Description

In the Lance columnar format, every dataset begins with a schema derived from Apache Arrow. The schema defines the column names, data types, and nullability constraints that govern all subsequent reads and writes. When a dataset is created, Lance writes the incoming data as one or more fragments (data files), each containing row groups organized by the schema. A manifest is then committed to track the schema, fragment metadata, and version history.

The creation process accepts a stream of Arrow RecordBatch objects, a destination (either a URI string or an existing dataset reference via WriteDestination), and optional WriteParams that control physical layout and write behavior. Key parameters include:

  • max_rows_per_file (default 1,048,576): Controls how many rows are written per data file.
  • max_rows_per_group (default 1,024): Controls the row group granularity within each file.
  • mode (WriteMode::Create): Ensures the dataset does not already exist.
  • enable_v2_manifest_paths: Enables constant-time manifest lookups on object stores.
  • enable_stable_row_ids: Assigns row IDs that survive compaction.

Lance supports local filesystems and cloud object stores (S3, GCS, Azure Blob) through a unified storage abstraction.

Usage

Use this approach when:

  • Creating a brand-new Lance dataset from Arrow data produced by ETL pipelines, ML feature stores, or data generators.
  • Initializing a dataset with a specific schema that will later receive appended data.
  • Setting up a versioned dataset on cloud object storage for collaborative ML workflows.

Theoretical Basis

Dataset creation in Lance follows the write-ahead, commit-after pattern:

  1. Fragment writing: Incoming RecordBatch rows are partitioned into fragments according to max_rows_per_file. Within each fragment, rows are further grouped into row groups of size max_rows_per_group. Each fragment is written as an independent Lance file using the configured encoding (e.g., dictionary, bit-packing, FSST for strings).
  2. Manifest commit: After all fragments are successfully written, a manifest is atomically committed. The manifest records the full schema, a list of fragment references, and metadata such as version number and timestamp. On object stores that support atomic rename (e.g., local FS, GCS), a rename-based commit is used. On S3, an external commit handler or DynamoDB-based lock is required.
  3. Versioning: Each successful write creates a new version. The previous version remains accessible for time-travel queries until explicitly cleaned up.

The schema itself is stored in the Lance-native format which extends Arrow schema with additional metadata fields (field IDs, encoding hints, storage class markers). This allows Lance to evolve schemas across versions without rewriting existing data files.

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment