Workflow:Treeverse LakeFS Data Version Control With Branches
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Data_Version_Control, Data_Lake_Management |
| Last Updated | 2026-02-08 10:00 GMT |
Overview
End-to-end process for versioning data lake objects using Git-like branching, committing, diffing, merging, and tagging operations via the lakeFS API.
Description
This workflow describes the core data version control lifecycle in lakeFS. It covers creating a versioned repository backed by object storage (AWS S3, Azure Blob Storage, or Google Cloud Storage), creating isolated branches for experimentation or ETL development, uploading and managing data objects, committing snapshots, comparing changes via diffs, merging branches to promote validated data, and tagging releases for reproducibility. The entire process mirrors Git semantics but operates on data lake objects at scale.
Usage
Execute this workflow when you need to manage data lake objects with version control semantics. Typical triggers include: starting a new data project that requires reproducibility, setting up isolated development and testing environments for ETL pipelines, needing to track and audit data changes over time, or requiring atomic rollback capabilities for production data.
Execution Steps
Step 1: Repository Creation
Create a new lakeFS repository linked to an object storage namespace. The repository serves as the top-level container for all versioned data, analogous to a Git repository. You specify the underlying storage location (e.g., an S3 bucket path) and a default branch name (typically "main").
Key considerations:
- The storage namespace must be a valid, accessible object storage path
- Each repository has exactly one storage namespace that cannot be changed after creation
- The default branch is created automatically and serves as the trunk
Step 2: Branch Creation
Create an isolated branch from an existing reference (branch, tag, or commit). Branching in lakeFS is a metadata-only operation that creates an independent line of development without copying any data. This enables safe experimentation, ETL testing, and parallel workstreams.
Key considerations:
- Branches are zero-copy — no data duplication occurs
- Multiple branches can exist simultaneously for parallel development
- Branch names follow the same conventions as Git branch names
Step 3: Data Upload
Upload data objects to a branch. Objects can be individual files or structured data (Parquet, CSV, JSON, etc.). lakeFS supports both direct API uploads and S3-compatible protocol uploads. Objects are stored in the underlying object storage and tracked by lakeFS metadata.
Key considerations:
- Objects are uploaded to a specific branch
- Uploads are staged (uncommitted) until explicitly committed
- Multipart uploads are supported for large files
- Object metadata (content type, user metadata) can be attached during upload
Step 4: Commit
Create an atomic, immutable snapshot of all staged changes on a branch. A commit records the complete state of the branch at a point in time, including all added, modified, and deleted objects. Each commit receives a unique identifier and can carry a message and custom metadata.
Key considerations:
- Commits are atomic — all staged changes are captured together
- Commit messages and metadata provide audit trail context
- Each commit has a unique ID that can be used as a reference
- Commits are immutable once created
Step 5: Diff and Review
Compare the state of data between any two references (branches, commits, or tags). The diff operation returns a list of changes (additions, modifications, deletions) between the two references, enabling review before merging or identifying what changed between versions.
Key considerations:
- Diffs can compare branches, commits, tags, or any combination
- Uncommitted changes on a branch can also be listed separately
- Results include path, change type, and size information
- Supports prefix-based filtering for targeted comparisons
Step 6: Merge
Integrate changes from one branch into another. The merge operation applies all committed changes from a source branch to a destination branch, creating a merge commit. Conflict resolution strategies (source wins, destination wins) can be specified.
Key considerations:
- Merging requires both branches to have committed changes
- Conflict resolution strategies handle divergent changes
- The merge creates a new commit on the destination branch
- Merge metadata records the strategy used and source branch
Step 7: Tag Creation
Create a named, immutable reference pointing to a specific commit. Tags serve as stable bookmarks for important data states such as production releases, model training snapshots, or regulatory compliance checkpoints. Unlike branches, tags do not advance with new commits.
Key considerations:
- Tags provide human-readable names for specific data versions
- Tags are immutable — they always point to the same commit
- Tags enable reproducible data access (e.g., for ML model retraining)
- Tags can be listed, filtered by prefix, and deleted