Principle:Lance format Lance Data Ingestion
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Columnar_Storage |
| Last Updated | 2026-02-08 19:00 GMT |
Overview
Data ingestion is the process of adding new records to an existing Lance dataset, supporting both append and overwrite modes to accommodate incremental and bulk loading workflows.
Description
After a Lance dataset has been created, new data is ingested through a builder-based API that separates the concerns of where to write (the destination), how to write (the parameters), and what to write (the data stream). The InsertBuilder provides this separation, enabling fine-grained control over each stage.
Lance supports three fundamental write modes through the WriteMode enum:
- Create: Write to a new dataset. Fails if the dataset already exists.
- Append: Add new data as additional fragments to an existing dataset. The schema of the new data must be compatible with the existing schema.
- Overwrite: Replace the entire dataset with new data, creating a new version. If the dataset does not exist, it is created.
Each ingestion operation is atomic at the version level: either all fragments are committed as a new version, or the operation fails without side effects on the dataset metadata. This guarantees that concurrent readers always see a consistent snapshot.
Usage
Use the ingestion pattern when:
- Continuously appending new data batches to a growing dataset (streaming or batch ETL).
- Replacing dataset contents with a refreshed snapshot (periodic full loads).
- Building a two-phase write pipeline where fragment writing and commit are separated for distributed execution.
Theoretical Basis
The ingestion pipeline follows a two-phase commit architecture:
- Phase 1 -- Fragment Writing: The
InsertBuilderconsumes the input data (either asVec<RecordBatch>or a streamingRecordBatchReader) and writes one or more fragment files to storage. Each fragment is self-contained, holding its own column data and metadata. This phase can be distributed across multiple workers using theexecute_uncommittedvariant.
- Phase 2 -- Transaction Commit: Once all fragments are written, a
Transactionis constructed that references the new fragments and their metadata. TheCommitBuilderthen atomically applies this transaction to the dataset, creating a new version. On conflict (another writer committed a version concurrently), the commit can be retried.
For append mode, the transaction adds the new fragments to the existing manifest without modifying any existing fragments. For overwrite mode, the transaction replaces the entire fragment list. In both cases, previous versions remain accessible until cleanup is performed.
The two-phase design enables advanced patterns such as:
- Distributed writes: Multiple workers each produce uncommitted fragments, which are then combined into a single commit.
- Exactly-once semantics: By separating write from commit, the commit can be retried without rewriting data.
- Conflict resolution: If a concurrent writer commits between phases 1 and 2, the operation can re-read the latest version and retry the commit.