Principle:Treeverse LakeFS Data Upload
| Knowledge Sources | |
|---|---|
| Domains | Data_Version_Control, Data_Engineering |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Data upload in version-controlled storage stages objects on a branch for subsequent commit, analogous to staging files with git add.
Description
Data upload is the process of writing data objects to a specific branch in a lakeFS repository. When an object is uploaded, it enters a staged (uncommitted) state on the target branch. The object is physically written to the underlying object storage, and lakeFS records metadata that associates the object with the branch. Staged objects are visible on the branch they were uploaded to but are not yet part of any commit, meaning they can be modified or deleted before being committed.
This mechanism mirrors the staging area concept in Git:
- Staging: Uploading an object places it in the branch's staging area, similar to
git add. - Visibility: Staged objects are visible to readers of the same branch but not to other branches.
- Persistence: The underlying data is persisted in object storage immediately, but its association with the branch's version history is not finalized until a commit is made.
- Metadata overlay: lakeFS maintains a metadata layer that tracks object paths, checksums, sizes, and content types independently of the physical storage layout.
Key aspects of the data upload process:
- Path-based addressing: Each object is identified by a path within the branch, forming a hierarchical namespace.
- Content-type awareness: Objects can carry content-type metadata, enabling downstream systems to interpret the data correctly.
- Conditional writes: Upload supports conditional operations using
If-None-MatchandIf-Matchheaders to prevent overwrites or ensure updates target a known version. - Multipart support: Large objects can be uploaded using multipart form data encoding.
Usage
Data upload is used whenever new or modified data needs to be introduced into a versioned repository:
- Ingesting raw data: Upload CSV, Parquet, JSON, or other data files from data pipelines into a staging branch.
- Updating datasets: Replace or augment existing objects on a branch with new versions.
- Storing model artifacts: Upload trained model files, feature stores, or evaluation results for versioned tracking.
- Configuration management: Store pipeline configuration files, schema definitions, or data quality rules alongside the data.
- Conditional writes: Use
If-None-Match: *to ensure an object is only created if it does not already exist, preventing accidental overwrites in concurrent workflows.
Theoretical Basis
Data upload in version-controlled storage operates on a two-phase write model:
Phase 1: Staging
When an object is uploaded to a branch, it is written to the underlying object storage and a staging entry is created in the lakeFS metadata store. The staging entry records:
- The object path within the branch namespace
- The physical address in object storage
- Checksum (typically MD5 or SHA-256) for integrity verification
- Size in bytes
- Content type and user-defined metadata
Phase 2: Commit
The staged object becomes part of the permanent version history only when a commit operation is performed on the branch. Until then, the staging entry can be overwritten, deleted, or rolled back.
Object addressing:
Each object in lakeFS has two addresses:
- Logical address: The path within the branch (e.g.,
data/customers/2024/01/records.parquet) - Physical address: The actual location in the underlying object store (e.g.,
s3://my-bucket/repo-id/data/abc123def456)
lakeFS maintains the mapping between logical and physical addresses, enabling features like branching and merging without data duplication.
Conditional write semantics:
If-None-Match: *ensures the upload succeeds only if no object exists at the target path (create-only semantics).If-Match: <etag>ensures the upload succeeds only if the existing object matches the specified ETag (compare-and-swap semantics).
These conditional operations enable safe concurrent writes to the same branch.