Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Treeverse LakeFS Tag Creation

From Leeroopedia


Knowledge Sources
Domains Data_Version_Control, Data_Engineering
Last Updated 2026-02-08 00:00 GMT

Overview

Tag creation in data version control establishes immutable, human-readable reference points for specific data versions, enabling reproducible access to known-good data states.

Description

A tag in lakeFS is an immutable, named pointer to a specific commit. Unlike branches, which are mutable pointers that advance as new commits are made, tags remain permanently fixed to the commit they were created on. Tags are analogous to Git tags and serve as stable, human-readable labels for important data milestones.

Tags are used to mark significant points in the data version history:

  • Release markers: Label a commit as a specific data release (e.g., v1.0, release-2026-02).
  • Data snapshots: Create named references to the state of data at a specific point in time (e.g., end-of-quarter-q4-2025).
  • Training data versions: Mark the exact data state used for a machine learning model training run (e.g., model-v3-training-data).
  • Audit anchors: Provide stable references for compliance and audit requirements that must point to a specific, unchangeable data state.

Key properties of tags:

  • Immutability: Once created, a tag always points to the same commit. This guarantees reproducibility.
  • Human-readable naming: Tags provide meaningful names for commit IDs that would otherwise be opaque hashes.
  • Lightweight: Tags are simple pointer records and do not duplicate any data.
  • Addressable: Tags can be used anywhere a commit reference is accepted (in diffs, merges, branch creation, etc.).

Usage

Tag creation is appropriate in the following scenarios:

  • Data release management: Tag each production data release for reproducible access by downstream consumers.
  • ML experiment tracking: Tag the training data commit for each model version to ensure exact reproducibility.
  • Regulatory compliance: Create immutable references to data states required by audit or regulatory frameworks.
  • Pipeline checkpoints: Tag successful pipeline completions to enable rapid rollback if subsequent runs produce issues.
  • Data sharing: Provide stable references that external teams or systems can use to access specific data versions.

Theoretical Basis

Tags as immutable references:

A tag is formally defined as:

tag(name) -> commit_id (immutable)

This contrasts with a branch:

branch(name) -> commit_id (mutable, advances with each new commit)

The immutability of tags is a critical property because it guarantees that any system referencing a tag will always see the exact same data, regardless of when the reference is resolved. This is essential for reproducibility and audit requirements.

Tags vs. commit IDs:

While commit IDs themselves are immutable references, they are content-addressable hashes (e.g., a1b2c3d4e5f6) that are not human-readable. Tags provide a semantic layer on top of commit IDs:

Property Commit ID Tag
Immutable Yes Yes
Human-readable No Yes
Semantic meaning None (content hash) User-defined (e.g., "v1.0")
Discovery Requires log traversal Listed in tag index

Force-create semantics:

While tags are conceptually immutable, lakeFS supports a force flag that allows an existing tag to be overwritten. This is a destructive operation that should be used with extreme caution, as it breaks the immutability guarantee for any system that was relying on the previous tag target. Force-create is primarily intended for correcting tagging errors shortly after creation.

Tags in the version DAG:

Tags do not participate in the commit DAG structure; they are external labels that point into the DAG. Creating or deleting a tag does not modify the commit history, branch structure, or any data objects. Tags are purely navigational aids.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment