Principle:Lance format Lance Schema Evolution
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Columnar_Storage |
| Last Updated | 2026-02-08 19:00 GMT |
Overview
Schema evolution is the process of modifying a Lance dataset's schema by adding, altering, or dropping columns without requiring a full rewrite of existing data.
Description
As ML datasets evolve over time, their schemas must adapt to new features, renamed fields, and changed data types. Lance supports three schema evolution operations:
Adding columns (add_columns): New columns are appended to the schema using one of several transform strategies:
- BatchUDF: A user-defined function that receives existing rows and produces new column values.
- SqlExpressions: SQL expressions that compute new column values from existing columns.
- Stream: A pre-computed stream of RecordBatches containing the new column data.
- Reader: A RecordBatchReader providing the new column data.
- AllNulls: Adds columns initialized to null values (schema-only change followed by lazy materialization).
Altering columns (alter_columns): Existing columns can be renamed, have their nullability changed, or be cast to a different data type. Renaming and nullability changes are zero-copy metadata-only operations that preserve indices. Data type changes require rewriting the affected column data.
Dropping columns (drop_columns): Columns are removed from the schema. This is a metadata-only operation; the physical column data remains in storage until compaction and cleanup are performed.
All schema evolution operations create a new dataset version. Because they modify the schema, they may conflict with most other concurrent write operations and should be performed during low-write-activity windows.
Usage
Use schema evolution when:
- Adding computed feature columns to an ML training dataset (e.g., embeddings, normalized values).
- Renaming columns to follow updated naming conventions.
- Changing column types (e.g., widening Int32 to Int64, or Float32 to Float64).
- Removing deprecated or sensitive columns from the schema.
- Backfilling null columns that will be populated incrementally.
Theoretical Basis
Zero-Copy vs. Rewrite Operations
Lance classifies schema changes into two categories:
Zero-copy operations modify only the manifest metadata, without touching data files:
- Column renames (update field name in schema)
- Nullability changes (update field flag in schema)
- Column drops (remove field reference from schema; data files untouched)
- Adding all-null columns (add field to schema; no data written)
Rewrite operations require reading and rewriting data:
- Data type casts (each affected fragment is rewritten with the new encoding)
- Adding columns with UDFs or expressions (each fragment is read, the transform is applied, and the new column data is written as additional files)
Fragment-Level Column Files
Lance stores columns within fragments as independent column files. This enables:
- Additive column writes: New columns are written as new files alongside existing fragment files, rather than rewriting the entire fragment.
- Lazy column removal: Dropped columns are simply excluded from the schema; their files remain until compaction removes them.
Index Preservation
When columns are renamed or have their nullability changed, existing indices on those columns are preserved. When a column's data type is changed or the column is dropped, any indices referencing that column are automatically removed.