Workflow:Lance format Lance Dataset Lifecycle

Knowledge Sources	Lance Lance Docs Read and Write Guide
Domains	Data_Engineering, ML_Ops, Columnar_Storage
Last Updated	2026-02-08 19:00 GMT

Overview

End-to-end process for creating, writing, reading, updating, and deleting data in a Lance dataset using the columnar format optimized for ML workflows.

Description

This workflow covers the complete lifecycle of a Lance dataset from initial creation to ongoing mutation. Lance datasets store data in a columnar format with automatic versioning, meaning every write operation creates a new immutable version. The process spans schema definition, data ingestion from various sources (Arrow tables, Pandas DataFrames, record batch iterators), scanning with predicate pushdown and column projection, row-level updates and deletes using SQL expressions, and merge-insert (upsert) operations for bulk data synchronization.

Usage

Execute this workflow when you need to create a new dataset for ML training or inference, ingest data from external sources (Parquet, CSV, HuggingFace datasets), perform CRUD operations on existing datasets, or build data pipelines that require efficient random access and filtered scans over large columnar data.

Execution Steps

Step 1: Schema Definition and Dataset Creation

Define the Arrow schema describing the dataset's columns and data types. This includes specifying vector columns for embeddings, nested types for structured data, and metadata fields. Create the dataset by writing an initial batch of data to a URI (local filesystem, S3, GCS, or Azure Blob Storage). The write operation produces a manifest file that tracks the schema, fragments, and version history.

Key considerations:

Choose appropriate data types for each column (e.g., FixedSizeList for embeddings, LargeBinary for blobs)
Select the storage URI based on deployment target (local for development, object store for production)
The first write establishes version 1 of the dataset

Step 2: Data Ingestion

Write data into the dataset using one of several modes: create (new dataset), append (add rows to existing), or overwrite (replace all data). Data sources include Arrow Tables, Pandas DataFrames, PyArrow RecordBatch iterators for streaming large datasets, and Parquet files. For large-scale ingestion, use streaming iterators to avoid loading all data into memory simultaneously.

Key considerations:

Use iterator-based writes for datasets larger than available RAM
Append mode preserves existing data and creates a new version
Overwrite mode replaces all fragments but preserves version history
Each write operation is atomic and creates a new dataset version

Step 3: Data Scanning and Reading

Read data from the dataset using scans with optional column projection, row filtering, and batch size control. Lance pushes filters down to the storage layer to minimize I/O. For random access patterns, use the take operation to retrieve specific rows by index. Scans return data as Arrow RecordBatches for zero-copy integration with downstream processing.

Key considerations:

Always project only the columns you need to reduce I/O
Use SQL-style filter predicates to push filtering into the scan
Configure batch size based on available memory and processing requirements
Use batch iterators for processing datasets larger than memory

Step 4: Row-Level Updates

Modify existing rows using SQL expressions or literal values. Updates can target specific rows via a WHERE predicate or apply to the entire dataset. Lance supports expression-based updates (e.g., incrementing a counter column) and merge-insert (upsert) operations that conditionally insert, update, or delete rows based on a join key.

Key considerations:

Updates create new dataset versions without modifying existing data files
Use merge-insert for bulk synchronization with external data sources
Expression-based updates avoid reading and rewriting full rows
Deleted rows are tracked via deletion vectors, not physical removal

Step 5: Row Deletion

Remove rows matching a SQL predicate. Deletions are recorded as deletion vectors attached to fragments rather than physically removing data. This makes deletes fast (metadata-only) but accumulates storage overhead over time, which is addressed by the compaction step in the Table Optimization workflow.

Key considerations:

Deletions are logical, not physical; data files remain unchanged
Deleted rows are excluded from subsequent scans automatically
High deletion ratios reduce scan performance until compaction runs
Deletion predicates support the same SQL syntax as scan filters

Step 6: Schema Evolution

Evolve the dataset schema by adding new columns, dropping columns, renaming columns, or casting data types. Adding columns can include backfilling with SQL expressions or Python UDFs. Dropping and renaming columns are metadata-only operations that do not rewrite data files.

Key considerations:

Adding columns with expressions avoids full dataset rewrites
Use Python UDFs with checkpointing for expensive backfill operations (e.g., computing embeddings)
Column drops are metadata-only and take effect immediately
Type casting rewrites affected column data in new fragments

Execution Diagram

GitHub URL

Workflow Repository