Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Treeverse LakeFS Branch Creation

From Leeroopedia


Knowledge Sources
Domains Data_Version_Control, Data_Engineering
Last Updated 2026-02-08 00:00 GMT

Overview

Branch creation in data version control provides isolated workspaces for data experimentation without affecting production data.

Description

A branch in lakeFS is a mutable pointer to a specific commit, representing an isolated workspace where data changes can be staged, tested, and reviewed independently from other branches. Branch creation is analogous to git branch or git checkout -b in software version control: it produces a lightweight copy-on-write reference that initially points to the same data as its source, but can diverge independently.

The power of branching for data lies in zero-copy isolation. When a branch is created from an existing branch or commit, no data is physically duplicated. Instead, lakeFS creates a new pointer that shares the same underlying data objects. Only when changes are made on the new branch does divergence occur, and even then, only the changed objects consume additional storage.

Key properties of lakeFS branches include:

  • Isolation: Changes on one branch do not affect other branches until explicitly merged.
  • Zero-copy creation: Branching is instantaneous and incurs no additional storage cost at creation time.
  • Mutable pointer: Unlike tags, branches advance forward as new commits are made.
  • Source flexibility: A branch can be created from any existing branch name, commit ID, or tag.

Usage

Branch creation is used in a variety of data engineering and data science workflows:

  • Experimentation: Create a feature branch to test new data transformations, model training data changes, or schema migrations without risking production data.
  • A/B testing: Maintain parallel branches with different data configurations to compare outcomes.
  • Pipeline staging: Use branches to stage data pipeline outputs for validation before promoting to the main branch.
  • Environment isolation: Create branches for development, QA, and staging environments that mirror but do not affect production.
  • Rollback preparation: Before risky operations, create a branch as a checkpoint to easily revert if needed.

Theoretical Basis

Branch creation in data version control is grounded in the copy-on-write (CoW) model from file systems and version control theory:

Copy-on-write semantics:

When a branch is created from a source reference, the new branch initially shares the same commit pointer and all underlying data objects. The system maintains a reference count or pointer structure rather than duplicating data. Writes to the new branch create new object versions only for the modified objects, while unchanged objects continue to be shared.

Branch as a mutable reference:

Formally, a branch is a named mutable reference of the form:

branch(name) -> commit_id

Each new commit on the branch advances this pointer:

branch(name) -> new_commit_id where new_commit.parent = old_commit_id

Isolation model:

The isolation guarantee can be described as follows:

  1. Let B1 and B2 be two branches, both derived from commit C0.
  2. Any write W applied to B1 is invisible to readers of B2.
  3. Only an explicit merge operation can incorporate changes from B1 into B2 (or vice versa).

This model enables safe parallel experimentation while maintaining the ability to integrate changes through controlled merge operations.

Force creation:

When the force flag is set, an existing branch with the same name is reset to point to the specified source commit, effectively discarding all previous state on that branch. This is a destructive operation and should be used with caution.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment