Workflow:Treeverse LakeFS Data Version Control With Branches

Knowledge Sources	lakeFS lakeFS Documentation lakeFS Quickstart lakeFS Samples
Domains	Data_Engineering, Data_Version_Control, Data_Lake_Management
Last Updated	2026-02-08 10:00 GMT

Overview

End-to-end process for versioning data lake objects using Git-like branching, committing, diffing, merging, and tagging operations via the lakeFS API.

Description

This workflow describes the core data version control lifecycle in lakeFS. It covers creating a versioned repository backed by object storage (AWS S3, Azure Blob Storage, or Google Cloud Storage), creating isolated branches for experimentation or ETL development, uploading and managing data objects, committing snapshots, comparing changes via diffs, merging branches to promote validated data, and tagging releases for reproducibility. The entire process mirrors Git semantics but operates on data lake objects at scale.

Usage

Execute this workflow when you need to manage data lake objects with version control semantics. Typical triggers include: starting a new data project that requires reproducibility, setting up isolated development and testing environments for ETL pipelines, needing to track and audit data changes over time, or requiring atomic rollback capabilities for production data.

Execution Steps

Step 1: Repository Creation

Create a new lakeFS repository linked to an object storage namespace. The repository serves as the top-level container for all versioned data, analogous to a Git repository. You specify the underlying storage location (e.g., an S3 bucket path) and a default branch name (typically "main").

Key considerations:

The storage namespace must be a valid, accessible object storage path
Each repository has exactly one storage namespace that cannot be changed after creation
The default branch is created automatically and serves as the trunk

Step 2: Branch Creation

Create an isolated branch from an existing reference (branch, tag, or commit). Branching in lakeFS is a metadata-only operation that creates an independent line of development without copying any data. This enables safe experimentation, ETL testing, and parallel workstreams.

Key considerations:

Branches are zero-copy — no data duplication occurs
Multiple branches can exist simultaneously for parallel development
Branch names follow the same conventions as Git branch names

Step 3: Data Upload

Upload data objects to a branch. Objects can be individual files or structured data (Parquet, CSV, JSON, etc.). lakeFS supports both direct API uploads and S3-compatible protocol uploads. Objects are stored in the underlying object storage and tracked by lakeFS metadata.

Key considerations:

Objects are uploaded to a specific branch
Uploads are staged (uncommitted) until explicitly committed
Multipart uploads are supported for large files
Object metadata (content type, user metadata) can be attached during upload

Step 4: Commit

Create an atomic, immutable snapshot of all staged changes on a branch. A commit records the complete state of the branch at a point in time, including all added, modified, and deleted objects. Each commit receives a unique identifier and can carry a message and custom metadata.

Key considerations:

Commits are atomic — all staged changes are captured together
Commit messages and metadata provide audit trail context
Each commit has a unique ID that can be used as a reference
Commits are immutable once created

Step 5: Diff and Review

Compare the state of data between any two references (branches, commits, or tags). The diff operation returns a list of changes (additions, modifications, deletions) between the two references, enabling review before merging or identifying what changed between versions.

Key considerations:

Diffs can compare branches, commits, tags, or any combination
Uncommitted changes on a branch can also be listed separately
Results include path, change type, and size information
Supports prefix-based filtering for targeted comparisons

Step 6: Merge

Integrate changes from one branch into another. The merge operation applies all committed changes from a source branch to a destination branch, creating a merge commit. Conflict resolution strategies (source wins, destination wins) can be specified.

Key considerations:

Merging requires both branches to have committed changes
Conflict resolution strategies handle divergent changes
The merge creates a new commit on the destination branch
Merge metadata records the strategy used and source branch

Step 7: Tag Creation

Create a named, immutable reference pointing to a specific commit. Tags serve as stable bookmarks for important data states such as production releases, model training snapshots, or regulatory compliance checkpoints. Unlike branches, tags do not advance with new commits.

Key considerations:

Tags provide human-readable names for specific data versions
Tags are immutable — they always point to the same commit
Tags enable reproducible data access (e.g., for ML model retraining)
Tags can be listed, filtered by prefix, and deleted

Execution Diagram

GitHub URL

Workflow Repository