Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Treeverse LakeFS S3 Commit Management

From Leeroopedia


Knowledge Sources
Domains S3_Compatibility, Data_Integration
Last Updated 2026-02-08 00:00 GMT

Overview

Committing S3-staged changes through the lakeFS REST API to create versioned snapshots.

Description

Changes written through the S3 gateway are automatically staged but not committed. To create a version snapshot, users must call the lakeFS REST API commit endpoint. This separation of "write" (S3 protocol) and "version" (lakeFS API) operations is a fundamental design principle of the lakeFS S3 gateway integration.

The commit operation:

  1. Takes all staged changes on a branch (uploads, deletes, copies made via S3 or lakeFS API)
  2. Creates a single atomic commit with a message and optional metadata
  3. Returns a commit object with a unique ID, timestamp, and parent references
  4. Makes the committed state visible as the new head of the branch

This is the bridge between the S3 protocol world and the lakeFS versioning world.

Usage

Use this principle when:

  • Completing an S3-based data ingestion workflow with a commit
  • Understanding the two-phase nature of lakeFS writes (stage via S3, commit via API)
  • Designing ETL pipelines that write via S3 and need version control
  • Building automation that writes data through S3 tools and then commits via the REST API

Theoretical Basis

The commit model enforces a clear separation of concerns:

S3 Protocol Layer          lakeFS Versioning Layer
==================         =======================
PutObject    ----\
CopyObject   -----+--->  Staging Area  ---> Commit (REST API)  ---> Branch History
DeleteObject ----/             |                    |
                               |                    v
                        (uncommitted)        (immutable snapshot)

Why this separation matters:

  1. Atomicity: Multiple S3 writes can be grouped into a single atomic commit, ensuring consumers see a consistent state
  2. Isolation: Uncommitted changes on one branch do not affect other branches or consumers reading committed data
  3. Auditability: Every commit has a message, timestamp, committer, and optional metadata, creating a full audit trail
  4. Tool compatibility: S3-compatible tools do not need to know about commits; they write data using standard S3 operations

The commit workflow:

Step Protocol Operation Description
1 S3 PutObject / CopyObject / DeleteObject Write changes; they are automatically staged
2 REST API POST /repositories/{repo}/branches/{branch}/commits Commit all staged changes atomically
3 S3 or REST GetObject or list commits Read the committed data or inspect the commit history

Commit request schema (CommitCreation):

CommitCreation {
    message:     string       (required) -- Human-readable commit message
    metadata:    map[string]string (optional) -- Arbitrary key-value pairs for automation
    date:        integer      (optional) -- Override creation date (Unix Epoch in seconds)
    allow_empty: boolean      (optional, default: false) -- Allow commits with no changes
    force:       boolean      (optional, default: false) -- Force commit
}

Commit response schema (Commit):

Commit {
    id:             string          -- Unique commit identifier
    parents:        []string        -- Parent commit IDs
    committer:      string          -- Who created the commit
    message:        string          -- Commit message
    creation_date:  integer         -- Unix Epoch in seconds
    meta_range_id:  string          -- Internal reference to committed data
    metadata:       map[string]string -- User-provided metadata
}

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment