Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Lance format Lance Schema Evolution

From Leeroopedia


Knowledge Sources
Domains Data_Engineering, Columnar_Storage
Last Updated 2026-02-08 19:00 GMT

Overview

Schema evolution is the process of modifying a Lance dataset's schema by adding, altering, or dropping columns without requiring a full rewrite of existing data.

Description

As ML datasets evolve over time, their schemas must adapt to new features, renamed fields, and changed data types. Lance supports three schema evolution operations:

Adding columns (add_columns): New columns are appended to the schema using one of several transform strategies:

  • BatchUDF: A user-defined function that receives existing rows and produces new column values.
  • SqlExpressions: SQL expressions that compute new column values from existing columns.
  • Stream: A pre-computed stream of RecordBatches containing the new column data.
  • Reader: A RecordBatchReader providing the new column data.
  • AllNulls: Adds columns initialized to null values (schema-only change followed by lazy materialization).

Altering columns (alter_columns): Existing columns can be renamed, have their nullability changed, or be cast to a different data type. Renaming and nullability changes are zero-copy metadata-only operations that preserve indices. Data type changes require rewriting the affected column data.

Dropping columns (drop_columns): Columns are removed from the schema. This is a metadata-only operation; the physical column data remains in storage until compaction and cleanup are performed.

All schema evolution operations create a new dataset version. Because they modify the schema, they may conflict with most other concurrent write operations and should be performed during low-write-activity windows.

Usage

Use schema evolution when:

  • Adding computed feature columns to an ML training dataset (e.g., embeddings, normalized values).
  • Renaming columns to follow updated naming conventions.
  • Changing column types (e.g., widening Int32 to Int64, or Float32 to Float64).
  • Removing deprecated or sensitive columns from the schema.
  • Backfilling null columns that will be populated incrementally.

Theoretical Basis

Zero-Copy vs. Rewrite Operations

Lance classifies schema changes into two categories:

Zero-copy operations modify only the manifest metadata, without touching data files:

  • Column renames (update field name in schema)
  • Nullability changes (update field flag in schema)
  • Column drops (remove field reference from schema; data files untouched)
  • Adding all-null columns (add field to schema; no data written)

Rewrite operations require reading and rewriting data:

  • Data type casts (each affected fragment is rewritten with the new encoding)
  • Adding columns with UDFs or expressions (each fragment is read, the transform is applied, and the new column data is written as additional files)

Fragment-Level Column Files

Lance stores columns within fragments as independent column files. This enables:

  • Additive column writes: New columns are written as new files alongside existing fragment files, rather than rewriting the entire fragment.
  • Lazy column removal: Dropped columns are simply excluded from the schema; their files remain until compaction removes them.

Index Preservation

When columns are renamed or have their nullability changed, existing indices on those columns are preserved. When a column's data type is changed or the column is dropped, any indices referencing that column are automatically removed.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment