Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Lance format Lance Data Ingestion

From Leeroopedia


Knowledge Sources
Domains Data_Engineering, Columnar_Storage
Last Updated 2026-02-08 19:00 GMT

Overview

Data ingestion is the process of adding new records to an existing Lance dataset, supporting both append and overwrite modes to accommodate incremental and bulk loading workflows.

Description

After a Lance dataset has been created, new data is ingested through a builder-based API that separates the concerns of where to write (the destination), how to write (the parameters), and what to write (the data stream). The InsertBuilder provides this separation, enabling fine-grained control over each stage.

Lance supports three fundamental write modes through the WriteMode enum:

  • Create: Write to a new dataset. Fails if the dataset already exists.
  • Append: Add new data as additional fragments to an existing dataset. The schema of the new data must be compatible with the existing schema.
  • Overwrite: Replace the entire dataset with new data, creating a new version. If the dataset does not exist, it is created.

Each ingestion operation is atomic at the version level: either all fragments are committed as a new version, or the operation fails without side effects on the dataset metadata. This guarantees that concurrent readers always see a consistent snapshot.

Usage

Use the ingestion pattern when:

  • Continuously appending new data batches to a growing dataset (streaming or batch ETL).
  • Replacing dataset contents with a refreshed snapshot (periodic full loads).
  • Building a two-phase write pipeline where fragment writing and commit are separated for distributed execution.

Theoretical Basis

The ingestion pipeline follows a two-phase commit architecture:

  1. Phase 1 -- Fragment Writing: The InsertBuilder consumes the input data (either as Vec<RecordBatch> or a streaming RecordBatchReader) and writes one or more fragment files to storage. Each fragment is self-contained, holding its own column data and metadata. This phase can be distributed across multiple workers using the execute_uncommitted variant.
  1. Phase 2 -- Transaction Commit: Once all fragments are written, a Transaction is constructed that references the new fragments and their metadata. The CommitBuilder then atomically applies this transaction to the dataset, creating a new version. On conflict (another writer committed a version concurrently), the commit can be retried.

For append mode, the transaction adds the new fragments to the existing manifest without modifying any existing fragments. For overwrite mode, the transaction replaces the entire fragment list. In both cases, previous versions remain accessible until cleanup is performed.

The two-phase design enables advanced patterns such as:

  • Distributed writes: Multiple workers each produce uncommitted fragments, which are then combined into a single commit.
  • Exactly-once semantics: By separating write from commit, the commit can be retried without rewriting data.
  • Conflict resolution: If a concurrent writer commits between phases 1 and 2, the operation can re-read the latest version and retry the commit.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment