Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Treeverse LakeFS S3 Data Writing

From Leeroopedia


Knowledge Sources
Domains S3_Compatibility, Data_Integration
Last Updated 2026-02-08 00:00 GMT

Overview

Writing versioned data through S3-compatible protocol for seamless tool integration with lakeFS.

Description

Writing data through the lakeFS S3 gateway stages changes on the target branch, just like uploading through the lakeFS API. Changes written via S3 operations are uncommitted until a commit operation is performed via the lakeFS REST API. This separation allows batch writes through S3 followed by a single atomic commit.

The S3 gateway supports the full range of S3 write operations:

  • PutObject -- Upload a single object
  • CreateMultipartUpload / UploadPart / CompleteMultipartUpload -- Upload large objects in parts
  • CopyObject -- Server-side copy within or between repositories
  • DeleteObject -- Remove a single object
  • DeleteObjects -- Bulk delete multiple objects

Usage

Use this principle when:

  • Writing data to a lakeFS repository through S3-compatible tools
  • Ingesting data from ETL pipelines that use S3 as their output destination
  • Uploading large files that require multipart upload
  • Copying data between branches or repositories via the S3 protocol
  • Deleting objects from a branch through S3 tools

Theoretical Basis

The write path through the S3 gateway follows a stage-then-commit model:

1. Client writes objects via S3 PutObject/MultipartUpload
   --> Objects are staged (uncommitted) on the target branch

2. Client calls lakeFS REST API to commit
   --> POST /api/v1/repositories/{repo}/branches/{branch}/commits
   --> All staged changes become a single atomic commit

3. Committed data is now visible on the branch
   --> Other branches are unaffected

Supported write operations and their behavior:

S3 Operation lakeFS Behavior Notes
PutObject Stages the object on the branch Supports content type, user metadata, conditional writes (If-None-Match: *)
Multipart Upload Stages a large object assembled from parts Minimum part size: 5 MiB; supports up to 10,000 parts
CopyObject Server-side copy; no data re-upload Same-branch copy shares physical address; cross-repo copy creates new physical object
DeleteObject Marks the object as deleted (tombstone) Delete is staged until committed
DeleteObjects Bulk delete up to 1,000 objects per request Each object is independently staged for deletion

Key behaviors:

  1. All writes are staged (uncommitted) until an explicit commit via the lakeFS API
  2. No separate "add" step is required -- PutObject automatically stages the change
  3. Writes to read-only repositories are rejected with an error
  4. Conditional writes using If-None-Match: * prevent overwriting existing objects (returns 412 Precondition Failed)
  5. User metadata is preserved and can be set via standard S3 metadata headers (x-amz-meta-*)

Pseudocode for the write-then-commit workflow:

// Stage multiple objects via S3
s3_client.put_object(bucket="my-repo", key="main/data/file1.csv", body=data1)
s3_client.put_object(bucket="my-repo", key="main/data/file2.csv", body=data2)
s3_client.delete_object(bucket="my-repo", key="main/data/old_file.csv")

// Commit all staged changes atomically via lakeFS API
lakefs_api.commit(repo="my-repo", branch="main", message="Update dataset")

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment