Workflow:Lance format Lance Version Management
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Version_Control, ML_Ops |
| Last Updated | 2026-02-08 19:00 GMT |
Overview
End-to-end process for managing dataset versions, tags, and time-travel queries in Lance using its built-in automatic versioning and transaction system.
Description
This workflow covers Lance's version control system, which automatically creates a new immutable version for every write operation. Each version is tracked by a manifest file that records the schema, fragment list, and transaction metadata. Users can navigate between versions using numeric version IDs or human-readable tags, enabling reproducible ML experiments, dataset rollback, and audit trails. The transaction system provides ACID guarantees with optimistic concurrency control and conflict resolution.
Usage
Execute this workflow when you need to track dataset changes over time for reproducibility, tag specific versions for ML model training checkpoints, roll back to a previous dataset state after erroneous writes, or implement branching strategies for parallel experimentation on the same base dataset.
Execution Steps
Step 1: Understanding Automatic Versioning
Every write operation (append, overwrite, update, delete, schema change) automatically creates a new dataset version. Each version is assigned a monotonically increasing integer ID and records the timestamp, operation type, and metadata diff from the previous version. Versions are immutable once committed; they can only be removed by the cleanup process after the retention period expires.
Key considerations:
- Version numbers start at 1 and increment with each write
- Each version stores a complete manifest (not a diff), enabling fast access
- Versions are cheap to create since they share unchanged data fragments
- The latest version is the default when opening a dataset
Step 2: Version Listing and Inspection
List all available versions to understand the dataset's history. Each version entry includes its numeric ID, creation timestamp, and metadata describing the operation that created it. This provides an audit trail of all mutations applied to the dataset.
Key considerations:
- Version listing reads only manifest metadata, not data files
- Older versions may be unavailable if cleanup has removed their data files
- Version metadata includes the operation type (append, overwrite, delete, etc.)
- Use version timestamps to correlate dataset changes with external events
Step 3: Time-Travel Queries
Open a specific historical version of the dataset by providing a version number or tag name. All subsequent read operations on this handle reflect the dataset state at that version. This enables reproducible reads for ML training, debugging data issues, and comparing dataset states across versions.
Key considerations:
- Time-travel queries are read-only; you cannot write to a historical version
- Data files for old versions may be garbage collected; access may fail for very old versions
- Performance of historical reads is identical to reading the latest version
- Multiple readers can access different versions concurrently
Step 4: Tag Creation and Management
Create human-readable tags that point to specific version numbers. Tags provide stable references like "production", "training_v2", or "pre_cleanup" that persist even as new versions are created. Tags can be listed, created, and deleted to manage important dataset milestones.
Key considerations:
- Tag names must be unique within a dataset
- Tags survive new writes; they always point to the same version
- Deleting a tag does not affect the underlying version
- Use tags to mark versions used for training specific ML models
Step 5: Version Restoration
Restore the dataset to a previous version's state by creating a new version that copies the old version's manifest. This effectively "undoes" all changes made after the target version while preserving the full version history. Restoration is a metadata-only operation that does not copy data files.
Key considerations:
- Restore creates a new version (it does not delete intermediate versions)
- The restored version shares data fragments with the original
- Ensure the target version's data files have not been garbage collected
- Combine restoration with tagging to mark both the rollback point and the original