Workflow:Lance format Lance Dataset Lifecycle
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, ML_Ops, Columnar_Storage |
| Last Updated | 2026-02-08 19:00 GMT |
Overview
End-to-end process for creating, writing, reading, updating, and deleting data in a Lance dataset using the columnar format optimized for ML workflows.
Description
This workflow covers the complete lifecycle of a Lance dataset from initial creation to ongoing mutation. Lance datasets store data in a columnar format with automatic versioning, meaning every write operation creates a new immutable version. The process spans schema definition, data ingestion from various sources (Arrow tables, Pandas DataFrames, record batch iterators), scanning with predicate pushdown and column projection, row-level updates and deletes using SQL expressions, and merge-insert (upsert) operations for bulk data synchronization.
Usage
Execute this workflow when you need to create a new dataset for ML training or inference, ingest data from external sources (Parquet, CSV, HuggingFace datasets), perform CRUD operations on existing datasets, or build data pipelines that require efficient random access and filtered scans over large columnar data.
Execution Steps
Step 1: Schema Definition and Dataset Creation
Define the Arrow schema describing the dataset's columns and data types. This includes specifying vector columns for embeddings, nested types for structured data, and metadata fields. Create the dataset by writing an initial batch of data to a URI (local filesystem, S3, GCS, or Azure Blob Storage). The write operation produces a manifest file that tracks the schema, fragments, and version history.
Key considerations:
- Choose appropriate data types for each column (e.g., FixedSizeList for embeddings, LargeBinary for blobs)
- Select the storage URI based on deployment target (local for development, object store for production)
- The first write establishes version 1 of the dataset
Step 2: Data Ingestion
Write data into the dataset using one of several modes: create (new dataset), append (add rows to existing), or overwrite (replace all data). Data sources include Arrow Tables, Pandas DataFrames, PyArrow RecordBatch iterators for streaming large datasets, and Parquet files. For large-scale ingestion, use streaming iterators to avoid loading all data into memory simultaneously.
Key considerations:
- Use iterator-based writes for datasets larger than available RAM
- Append mode preserves existing data and creates a new version
- Overwrite mode replaces all fragments but preserves version history
- Each write operation is atomic and creates a new dataset version
Step 3: Data Scanning and Reading
Read data from the dataset using scans with optional column projection, row filtering, and batch size control. Lance pushes filters down to the storage layer to minimize I/O. For random access patterns, use the take operation to retrieve specific rows by index. Scans return data as Arrow RecordBatches for zero-copy integration with downstream processing.
Key considerations:
- Always project only the columns you need to reduce I/O
- Use SQL-style filter predicates to push filtering into the scan
- Configure batch size based on available memory and processing requirements
- Use batch iterators for processing datasets larger than memory
Step 4: Row-Level Updates
Modify existing rows using SQL expressions or literal values. Updates can target specific rows via a WHERE predicate or apply to the entire dataset. Lance supports expression-based updates (e.g., incrementing a counter column) and merge-insert (upsert) operations that conditionally insert, update, or delete rows based on a join key.
Key considerations:
- Updates create new dataset versions without modifying existing data files
- Use merge-insert for bulk synchronization with external data sources
- Expression-based updates avoid reading and rewriting full rows
- Deleted rows are tracked via deletion vectors, not physical removal
Step 5: Row Deletion
Remove rows matching a SQL predicate. Deletions are recorded as deletion vectors attached to fragments rather than physically removing data. This makes deletes fast (metadata-only) but accumulates storage overhead over time, which is addressed by the compaction step in the Table Optimization workflow.
Key considerations:
- Deletions are logical, not physical; data files remain unchanged
- Deleted rows are excluded from subsequent scans automatically
- High deletion ratios reduce scan performance until compaction runs
- Deletion predicates support the same SQL syntax as scan filters
Step 6: Schema Evolution
Evolve the dataset schema by adding new columns, dropping columns, renaming columns, or casting data types. Adding columns can include backfilling with SQL expressions or Python UDFs. Dropping and renaming columns are metadata-only operations that do not rewrite data files.
Key considerations:
- Adding columns with expressions avoids full dataset rewrites
- Use Python UDFs with checkpointing for expensive backfill operations (e.g., computing embeddings)
- Column drops are metadata-only and take effect immediately
- Type casting rewrites affected column data in new fragments