Principle:Lance format Lance Automatic Versioning
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Version_Control |
| Last Updated | 2026-02-08 19:00 GMT |
Overview
Automatic versioning is the mechanism by which every write operation on a Lance dataset produces a new, immutable version without requiring explicit user intervention.
Description
Lance implements a zero-copy, automatic versioning system where each mutation to a dataset (append, overwrite, delete, merge, index creation, or schema change) is recorded as a discrete version. Every version is identified by a monotonically increasing 64-bit integer and is backed by its own manifest file stored alongside the data. This approach ensures full reproducibility of any historical state of the dataset.
The versioning system is built on two key abstractions:
Transactions encapsulate the intent of a mutation. A Transaction object records the read_version (the version the writer observed before making changes), the Operation (the kind of mutation), and optional blob metadata. The transaction is serialized to a file so that concurrent writers can discover and reason about each other's changes.
Commit handlers enforce mutual exclusion at the version-assignment layer. When a writer is ready to persist its changes, it calls CommitHandler::commit() with a manifest targeting dataset.manifest.version + 1. If another writer has already claimed that version slot, the handler returns a CommitConflict error, triggering a retry loop that rebases the transaction onto the new head and tries the next version number.
Usage
Automatic versioning activates implicitly whenever a dataset is mutated. Users benefit from it in several scenarios:
- Reproducible ML pipelines -- pin a training job to a specific version so that re-runs always see identical data.
- Safe concurrent ingestion -- multiple writers can append to the same dataset; the commit protocol serializes their changes without data loss.
- Rollback after bad writes -- if an ingestion job introduces corrupt data, the dataset can be restored to a prior version with
Dataset::restore().
Theoretical Basis
Optimistic Concurrency Control
Lance uses an optimistic concurrency control (OCC) strategy. Writers proceed without acquiring locks, and conflicts are detected only at commit time. The protocol follows these steps:
- The writer reads the current dataset at version V and performs local computation.
- The writer builds a
Transactionwithread_version = V. - The writer attempts to commit a manifest at version V + 1.
- If the slot V + 1 is already taken, the writer:
- Loads all transactions committed since V.
- Checks each against its own transaction using
TransactionRebaseto determine compatibility. - If compatible, rebases its transaction onto the new head and retries at the new
latest + 1. - If incompatible, reports a conflict.
Slot Backoff
To reduce contention under high concurrency, retries use a SlotBackoff algorithm. The first attempt records its wall-clock duration, and subsequent attempts sleep for a randomized multiple of that duration, scaled exponentially by the attempt number. This spreads retrying writers across time slots.
Pseudocode
function commit_transaction(dataset, transaction):
target_version = dataset.version + 1
for attempt in 1..max_retries:
other_txns = load_transactions_since(dataset.version)
rebase = TransactionRebase(transaction)
for (v, txn) in other_txns:
rebase.check(txn, v)
transaction = rebase.finish(dataset)
manifest = transaction.build_manifest(dataset.manifest)
manifest.version = target_version
result = commit_handler.commit(manifest)
if result == OK:
return manifest
else if result == CommitConflict:
backoff.wait()
target_version = dataset.latest_version + 1
raise CommitConflict