Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Lance format Lance Automatic Versioning

From Leeroopedia


Knowledge Sources
Domains Data_Engineering, Version_Control
Last Updated 2026-02-08 19:00 GMT

Overview

Automatic versioning is the mechanism by which every write operation on a Lance dataset produces a new, immutable version without requiring explicit user intervention.

Description

Lance implements a zero-copy, automatic versioning system where each mutation to a dataset (append, overwrite, delete, merge, index creation, or schema change) is recorded as a discrete version. Every version is identified by a monotonically increasing 64-bit integer and is backed by its own manifest file stored alongside the data. This approach ensures full reproducibility of any historical state of the dataset.

The versioning system is built on two key abstractions:

Transactions encapsulate the intent of a mutation. A Transaction object records the read_version (the version the writer observed before making changes), the Operation (the kind of mutation), and optional blob metadata. The transaction is serialized to a file so that concurrent writers can discover and reason about each other's changes.

Commit handlers enforce mutual exclusion at the version-assignment layer. When a writer is ready to persist its changes, it calls CommitHandler::commit() with a manifest targeting dataset.manifest.version + 1. If another writer has already claimed that version slot, the handler returns a CommitConflict error, triggering a retry loop that rebases the transaction onto the new head and tries the next version number.

Usage

Automatic versioning activates implicitly whenever a dataset is mutated. Users benefit from it in several scenarios:

  • Reproducible ML pipelines -- pin a training job to a specific version so that re-runs always see identical data.
  • Safe concurrent ingestion -- multiple writers can append to the same dataset; the commit protocol serializes their changes without data loss.
  • Rollback after bad writes -- if an ingestion job introduces corrupt data, the dataset can be restored to a prior version with Dataset::restore().

Theoretical Basis

Optimistic Concurrency Control

Lance uses an optimistic concurrency control (OCC) strategy. Writers proceed without acquiring locks, and conflicts are detected only at commit time. The protocol follows these steps:

  1. The writer reads the current dataset at version V and performs local computation.
  2. The writer builds a Transaction with read_version = V.
  3. The writer attempts to commit a manifest at version V + 1.
  4. If the slot V + 1 is already taken, the writer:
    1. Loads all transactions committed since V.
    2. Checks each against its own transaction using TransactionRebase to determine compatibility.
    3. If compatible, rebases its transaction onto the new head and retries at the new latest + 1.
    4. If incompatible, reports a conflict.

Slot Backoff

To reduce contention under high concurrency, retries use a SlotBackoff algorithm. The first attempt records its wall-clock duration, and subsequent attempts sleep for a randomized multiple of that duration, scaled exponentially by the attempt number. This spreads retrying writers across time slots.

Pseudocode

function commit_transaction(dataset, transaction):
    target_version = dataset.version + 1
    for attempt in 1..max_retries:
        other_txns = load_transactions_since(dataset.version)
        rebase = TransactionRebase(transaction)
        for (v, txn) in other_txns:
            rebase.check(txn, v)
        transaction = rebase.finish(dataset)
        manifest = transaction.build_manifest(dataset.manifest)
        manifest.version = target_version
        result = commit_handler.commit(manifest)
        if result == OK:
            return manifest
        else if result == CommitConflict:
            backoff.wait()
            target_version = dataset.latest_version + 1
    raise CommitConflict

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment