Principle:Treeverse LakeFS Commit
| Knowledge Sources | |
|---|---|
| Domains | Data_Version_Control, Data_Engineering |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
An atomic commit in data version control creates an immutable snapshot of the complete data state on a branch, enabling point-in-time recovery and full auditability.
Description
A commit in lakeFS captures the entire state of a branch at a specific moment in time. It is analogous to a Git commit: it records all staged changes as a permanent, immutable snapshot. Once created, a commit cannot be altered or deleted, providing a reliable audit trail and enabling precise rollback to any previous data state.
Each commit in lakeFS contains:
- Unique identifier: A content-addressable hash that uniquely identifies the commit.
- Parent reference(s): One or more parent commit IDs, forming a directed acyclic graph (DAG) of the version history.
- Message: A human-readable description of the changes included in the commit.
- Metadata: Optional key-value pairs for storing additional context (e.g., pipeline run ID, data source, author information).
- Meta-range ID: A reference to the internal data structure that records the complete set of objects in the committed state.
- Committer: The identity of the user or system that created the commit.
- Creation date: The timestamp when the commit was created.
- Generation: An integer representing the commit's position in the DAG, used for efficient traversal.
Key properties of commits:
- Atomicity: A commit either succeeds completely or fails entirely; there is no partial commit state.
- Immutability: Once created, the contents of a commit cannot be changed.
- Completeness: A commit captures the full state of all objects on the branch, not just the changes.
- Reproducibility: Any commit can be checked out or referenced to recreate the exact data state at that point in time.
Usage
Commits are the fundamental building blocks of data version control. Use commits when:
- Finalizing pipeline outputs: After a data pipeline completes its processing, commit the results to create a permanent record.
- Creating checkpoints: Before risky operations (schema changes, large-scale data transformations), commit the current state as a safety net.
- Enabling auditability: Each commit provides a traceable record of who changed what data, when, and why.
- Supporting reproducibility: Commit IDs can be recorded alongside model training runs or analysis results to ensure exact data provenance.
- Facilitating review workflows: Committed changes can be diffed, reviewed, and discussed before being merged into production branches.
Theoretical Basis
Commits in lakeFS follow the content-addressable storage model common to modern version control systems:
Content-addressable identification:
Each commit is identified by a hash derived from its contents (parent references, meta-range, metadata, message, and timestamp). This ensures that identical states produce identical identifiers and that any tampering is detectable.
Directed Acyclic Graph (DAG):
Commits form a DAG where each commit points to one or more parent commits:
- Linear history: A standard commit has exactly one parent, extending the branch history linearly.
- Merge commit: A merge commit has two parents, representing the integration of two branches.
- Root commit: The initial commit in a repository has no parent.
Snapshot model:
Unlike delta-based systems that store only changes, lakeFS commits use a snapshot model. Each commit records a complete manifest (meta-range) of all objects in the repository at that point. This enables:
- O(1) checkout time (no need to replay deltas)
- Efficient garbage collection of unreferenced objects
- Simple and reliable state reconstruction
Pre-commit hooks:
lakeFS supports pre-commit hooks that execute custom validation logic before a commit is finalized. If a pre-commit hook returns an error, the commit is rejected with a 412 (Precondition Failed) status. This enables data quality gates, schema validation, and policy enforcement as part of the commit workflow.
Empty commits and force commits:
allow_empty: Permits creating a commit even when there are no staged changes, useful for recording metadata-only events.force: Bypasses certain safety checks, enabling commits in situations that would otherwise be rejected.