Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Workflow:Treeverse LakeFS Data Version Control With Branches

From Leeroopedia


Knowledge Sources
Domains Data_Engineering, Data_Version_Control, Data_Lake_Management
Last Updated 2026-02-08 10:00 GMT

Overview

End-to-end process for versioning data lake objects using Git-like branching, committing, diffing, merging, and tagging operations via the lakeFS API.

Description

This workflow describes the core data version control lifecycle in lakeFS. It covers creating a versioned repository backed by object storage (AWS S3, Azure Blob Storage, or Google Cloud Storage), creating isolated branches for experimentation or ETL development, uploading and managing data objects, committing snapshots, comparing changes via diffs, merging branches to promote validated data, and tagging releases for reproducibility. The entire process mirrors Git semantics but operates on data lake objects at scale.

Usage

Execute this workflow when you need to manage data lake objects with version control semantics. Typical triggers include: starting a new data project that requires reproducibility, setting up isolated development and testing environments for ETL pipelines, needing to track and audit data changes over time, or requiring atomic rollback capabilities for production data.

Execution Steps

Step 1: Repository Creation

Create a new lakeFS repository linked to an object storage namespace. The repository serves as the top-level container for all versioned data, analogous to a Git repository. You specify the underlying storage location (e.g., an S3 bucket path) and a default branch name (typically "main").

Key considerations:

  • The storage namespace must be a valid, accessible object storage path
  • Each repository has exactly one storage namespace that cannot be changed after creation
  • The default branch is created automatically and serves as the trunk

Step 2: Branch Creation

Create an isolated branch from an existing reference (branch, tag, or commit). Branching in lakeFS is a metadata-only operation that creates an independent line of development without copying any data. This enables safe experimentation, ETL testing, and parallel workstreams.

Key considerations:

  • Branches are zero-copy — no data duplication occurs
  • Multiple branches can exist simultaneously for parallel development
  • Branch names follow the same conventions as Git branch names

Step 3: Data Upload

Upload data objects to a branch. Objects can be individual files or structured data (Parquet, CSV, JSON, etc.). lakeFS supports both direct API uploads and S3-compatible protocol uploads. Objects are stored in the underlying object storage and tracked by lakeFS metadata.

Key considerations:

  • Objects are uploaded to a specific branch
  • Uploads are staged (uncommitted) until explicitly committed
  • Multipart uploads are supported for large files
  • Object metadata (content type, user metadata) can be attached during upload

Step 4: Commit

Create an atomic, immutable snapshot of all staged changes on a branch. A commit records the complete state of the branch at a point in time, including all added, modified, and deleted objects. Each commit receives a unique identifier and can carry a message and custom metadata.

Key considerations:

  • Commits are atomic — all staged changes are captured together
  • Commit messages and metadata provide audit trail context
  • Each commit has a unique ID that can be used as a reference
  • Commits are immutable once created

Step 5: Diff and Review

Compare the state of data between any two references (branches, commits, or tags). The diff operation returns a list of changes (additions, modifications, deletions) between the two references, enabling review before merging or identifying what changed between versions.

Key considerations:

  • Diffs can compare branches, commits, tags, or any combination
  • Uncommitted changes on a branch can also be listed separately
  • Results include path, change type, and size information
  • Supports prefix-based filtering for targeted comparisons

Step 6: Merge

Integrate changes from one branch into another. The merge operation applies all committed changes from a source branch to a destination branch, creating a merge commit. Conflict resolution strategies (source wins, destination wins) can be specified.

Key considerations:

  • Merging requires both branches to have committed changes
  • Conflict resolution strategies handle divergent changes
  • The merge creates a new commit on the destination branch
  • Merge metadata records the strategy used and source branch

Step 7: Tag Creation

Create a named, immutable reference pointing to a specific commit. Tags serve as stable bookmarks for important data states such as production releases, model training snapshots, or regulatory compliance checkpoints. Unlike branches, tags do not advance with new commits.

Key considerations:

  • Tags provide human-readable names for specific data versions
  • Tags are immutable — they always point to the same commit
  • Tags enable reproducible data access (e.g., for ML model retraining)
  • Tags can be listed, filtered by prefix, and deleted

Execution Diagram

GitHub URL

Workflow Repository