Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Treeverse LakeFS Data Upload

From Leeroopedia


Knowledge Sources
Domains Data_Version_Control, Data_Engineering
Last Updated 2026-02-08 00:00 GMT

Overview

Data upload in version-controlled storage stages objects on a branch for subsequent commit, analogous to staging files with git add.

Description

Data upload is the process of writing data objects to a specific branch in a lakeFS repository. When an object is uploaded, it enters a staged (uncommitted) state on the target branch. The object is physically written to the underlying object storage, and lakeFS records metadata that associates the object with the branch. Staged objects are visible on the branch they were uploaded to but are not yet part of any commit, meaning they can be modified or deleted before being committed.

This mechanism mirrors the staging area concept in Git:

  • Staging: Uploading an object places it in the branch's staging area, similar to git add.
  • Visibility: Staged objects are visible to readers of the same branch but not to other branches.
  • Persistence: The underlying data is persisted in object storage immediately, but its association with the branch's version history is not finalized until a commit is made.
  • Metadata overlay: lakeFS maintains a metadata layer that tracks object paths, checksums, sizes, and content types independently of the physical storage layout.

Key aspects of the data upload process:

  • Path-based addressing: Each object is identified by a path within the branch, forming a hierarchical namespace.
  • Content-type awareness: Objects can carry content-type metadata, enabling downstream systems to interpret the data correctly.
  • Conditional writes: Upload supports conditional operations using If-None-Match and If-Match headers to prevent overwrites or ensure updates target a known version.
  • Multipart support: Large objects can be uploaded using multipart form data encoding.

Usage

Data upload is used whenever new or modified data needs to be introduced into a versioned repository:

  • Ingesting raw data: Upload CSV, Parquet, JSON, or other data files from data pipelines into a staging branch.
  • Updating datasets: Replace or augment existing objects on a branch with new versions.
  • Storing model artifacts: Upload trained model files, feature stores, or evaluation results for versioned tracking.
  • Configuration management: Store pipeline configuration files, schema definitions, or data quality rules alongside the data.
  • Conditional writes: Use If-None-Match: * to ensure an object is only created if it does not already exist, preventing accidental overwrites in concurrent workflows.

Theoretical Basis

Data upload in version-controlled storage operates on a two-phase write model:

Phase 1: Staging

When an object is uploaded to a branch, it is written to the underlying object storage and a staging entry is created in the lakeFS metadata store. The staging entry records:

  • The object path within the branch namespace
  • The physical address in object storage
  • Checksum (typically MD5 or SHA-256) for integrity verification
  • Size in bytes
  • Content type and user-defined metadata

Phase 2: Commit

The staged object becomes part of the permanent version history only when a commit operation is performed on the branch. Until then, the staging entry can be overwritten, deleted, or rolled back.

Object addressing:

Each object in lakeFS has two addresses:

  1. Logical address: The path within the branch (e.g., data/customers/2024/01/records.parquet)
  2. Physical address: The actual location in the underlying object store (e.g., s3://my-bucket/repo-id/data/abc123def456)

lakeFS maintains the mapping between logical and physical addresses, enabling features like branching and merging without data duplication.

Conditional write semantics:

  • If-None-Match: * ensures the upload succeeds only if no object exists at the target path (create-only semantics).
  • If-Match: <etag> ensures the upload succeeds only if the existing object matches the specified ETag (compare-and-swap semantics).

These conditional operations enable safe concurrent writes to the same branch.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment