Principle:Treeverse LakeFS Diff and Review
| Knowledge Sources | |
|---|---|
| Domains | Data_Version_Control, Data_Engineering |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Diff operations in data version control compare two references (branches, commits, or tags) to identify added, removed, and changed objects, enabling structured data review before merge.
Description
The diff and review process in lakeFS provides a mechanism to compare the state of data between two reference points. This is analogous to git diff and is a critical step in data governance workflows, enabling teams to review changes before promoting them to production branches.
A diff operation produces a list of differences between two references, where each difference identifies:
- Added objects: Objects present in the right reference but absent in the left reference.
- Removed objects: Objects present in the left reference but absent in the right reference.
- Changed objects: Objects present in both references but with different content (different checksums or sizes).
- Conflict objects: Objects modified in both references in incompatible ways (relevant for three-dot diffs).
lakeFS supports two types of diff:
- Two-dot diff (
left..right): Directly compares the state at two references. This answers "what is different between these two states?" - Three-dot diff (
left...right): Compares the right reference against the common ancestor (merge base) of both references. This answers "what changes were introduced on the right branch since it diverged from the left branch?" This is the default mode and is typically more useful for merge review.
Usage
Diff and review operations are essential in the following scenarios:
- Pre-merge review: Before merging a feature branch into production, diff the branch against main to understand what data changes will be introduced.
- Data quality validation: Inspect diffs to ensure that expected objects were added or updated and that no unintended deletions occurred.
- Change auditing: Compare two commits to understand what changed between data pipeline runs.
- Conflict detection: Use three-dot diffs to identify potential conflicts before attempting a merge operation.
- Compliance and governance: Maintain a reviewable record of all data changes, supporting regulatory and internal audit requirements.
Theoretical Basis
Two-dot diff:
A two-dot diff computes the symmetric difference between two data snapshots. Given references A and B, it identifies:
- Objects in B but not A (additions)
- Objects in A but not B (deletions)
- Objects in both A and B but with different content (modifications)
Formally: diff(A, B) = { (path, type) | state(A, path) != state(B, path) }
Three-dot diff:
A three-dot diff first computes the merge base M of references A and B, then diffs B against M. This isolates the changes introduced on B's lineage since it diverged from A:
- Find merge base:
M = merge_base(A, B) - Compute diff:
diff(M, B)
This is particularly useful for merge review because it shows only the changes that the source branch would introduce, excluding changes that were already present in the common ancestor.
Pagination and filtering:
For repositories with large numbers of objects, diff results are paginated and can be filtered by path prefix and delimiter. This enables efficient review of specific subdirectories or data partitions without loading the entire diff.
Diff as a prerequisite for merge:
In a well-governed data workflow, the diff-review-merge cycle ensures that:
- Changes are visible and comprehensible before integration.
- Conflicts are identified and resolved proactively.
- Stakeholders can approve or reject changes based on the diff output.
- An audit trail of what was reviewed and by whom is maintained.