Principle:Lance format Lance Schema Definition And Dataset Creation
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Columnar_Storage |
| Last Updated | 2026-02-08 19:00 GMT |
Overview
Schema definition and dataset creation is the foundational step in the Lance dataset lifecycle, where an Arrow schema is bound to a storage destination and the initial data is persisted as a versioned Lance dataset.
Description
In the Lance columnar format, every dataset begins with a schema derived from Apache Arrow. The schema defines the column names, data types, and nullability constraints that govern all subsequent reads and writes. When a dataset is created, Lance writes the incoming data as one or more fragments (data files), each containing row groups organized by the schema. A manifest is then committed to track the schema, fragment metadata, and version history.
The creation process accepts a stream of Arrow RecordBatch objects, a destination (either a URI string or an existing dataset reference via WriteDestination), and optional WriteParams that control physical layout and write behavior. Key parameters include:
- max_rows_per_file (default 1,048,576): Controls how many rows are written per data file.
- max_rows_per_group (default 1,024): Controls the row group granularity within each file.
- mode (
WriteMode::Create): Ensures the dataset does not already exist. - enable_v2_manifest_paths: Enables constant-time manifest lookups on object stores.
- enable_stable_row_ids: Assigns row IDs that survive compaction.
Lance supports local filesystems and cloud object stores (S3, GCS, Azure Blob) through a unified storage abstraction.
Usage
Use this approach when:
- Creating a brand-new Lance dataset from Arrow data produced by ETL pipelines, ML feature stores, or data generators.
- Initializing a dataset with a specific schema that will later receive appended data.
- Setting up a versioned dataset on cloud object storage for collaborative ML workflows.
Theoretical Basis
Dataset creation in Lance follows the write-ahead, commit-after pattern:
- Fragment writing: Incoming
RecordBatchrows are partitioned into fragments according tomax_rows_per_file. Within each fragment, rows are further grouped into row groups of sizemax_rows_per_group. Each fragment is written as an independent Lance file using the configured encoding (e.g., dictionary, bit-packing, FSST for strings). - Manifest commit: After all fragments are successfully written, a manifest is atomically committed. The manifest records the full schema, a list of fragment references, and metadata such as version number and timestamp. On object stores that support atomic rename (e.g., local FS, GCS), a rename-based commit is used. On S3, an external commit handler or DynamoDB-based lock is required.
- Versioning: Each successful write creates a new version. The previous version remains accessible for time-travel queries until explicitly cleaned up.
The schema itself is stored in the Lance-native format which extends Arrow schema with additional metadata fields (field IDs, encoding hints, storage class markers). This allows Lance to evolve schemas across versions without rewriting existing data files.