Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Treeverse LakeFS Commit Via S3 Workflow

From Leeroopedia


Knowledge Sources
Domains S3_Compatibility, REST_API
Last Updated 2026-02-08 00:00 GMT

Overview

API endpoint for committing changes that were staged via the S3 gateway, bridging the S3 write protocol with lakeFS version control.

Description

This implementation documents the lakeFS REST API commit endpoint as used in the S3 gateway integration workflow. After writing objects through the S3 gateway (PutObject, CopyObject, DeleteObject), the commit endpoint is called to create an atomic version snapshot of all staged changes.

The endpoint is:

POST /api/v1/repositories/{repository}/branches/{branch}/commits

Key behavior: All objects written via S3 PutObject are automatically staged on the target branch. No separate "add" or "stage" step is needed before committing. The commit operation packages all staged changes into a single immutable snapshot.

Usage

Use this implementation when:

  • Completing an S3-based data ingestion pipeline with a version commit
  • Building automation that writes data via S3 tools and commits via REST API
  • Creating atomic snapshots of data that was written through Spark, pandas, or AWS CLI via the S3 gateway

Code Reference

Source Location

  • File: api/swagger.yml
  • Lines: L4252-4292 (commit endpoint definition)
  • Schemas: L651-673 (CommitCreation), L600-630 (Commit)
  • Operation ID: commit

Signature

# api/swagger.yml - commit endpoint
/repositories/{repository}/branches/{branch}/commits:
  parameters:
    - in: path
      name: repository
      required: true
      schema:
        type: string
    - in: path
      name: branch
      required: true
      schema:
        type: string
  post:
    parameters:
      - in: query
        name: source_metarange
        required: false
        description: >
          The source metarange to commit.
          Branch must not have uncommitted changes.
        schema:
          type: string
    tags:
      - commits
    operationId: commit
    summary: create commit
    requestBody:
      required: true
      content:
        application/json:
          schema:
            $ref: "#/components/schemas/CommitCreation"
    responses:
      201:
        description: commit
        content:
          application/json:
            schema:
              $ref: "#/components/schemas/Commit"
      400:
        $ref: "#/components/responses/ValidationError"
      401:
        $ref: "#/components/responses/Unauthorized"
      403:
        $ref: "#/components/responses/Forbidden"
      404:
        $ref: "#/components/responses/NotFound"
      409:
        $ref: "#/components/responses/Conflict"
      412:
        $ref: "#/components/responses/PreconditionFailed"
      429:
        description: too many requests

Import

import requests

# Python requests library for calling the lakeFS REST API
LAKEFS_ENDPOINT = 'http://localhost:8000'
LAKEFS_ACCESS_KEY = 'AKIAIOSFDNN7EXAMPLEQ'
LAKEFS_SECRET_KEY = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'

I/O Contract

Inputs

Parameter Location Type Required Description
repository path string Yes lakeFS repository name
branch path string Yes Branch name to commit on
message body string Yes Human-readable commit message
metadata body map[string]string No Arbitrary key-value pairs for automation and auditing
date body integer (int64) No Override creation date (Unix Epoch in seconds)
allow_empty body boolean No Allow commits with no changes (default: false)
force body boolean No Force commit (default: false)
source_metarange query string No Source metarange to commit (branch must have no uncommitted changes)

Outputs

Field Type Description
id string Unique commit identifier (SHA-256 hash)
parents []string Parent commit IDs (single parent for normal commits)
committer string The user who created the commit
message string The commit message
creation_date integer (int64) Unix Epoch in seconds
meta_range_id string Internal reference to the committed data range
metadata map[string]string User-provided metadata key-value pairs

Error responses:

HTTP Status Description
400 Validation error (invalid request body)
401 Unauthorized (invalid or missing credentials)
403 Forbidden (insufficient permissions)
404 Repository or branch not found
409 Conflict (concurrent commit on same branch)
412 Precondition failed
429 Too many requests (rate limited)

Usage Examples

Python: Full S3 write + commit workflow

import boto3
import requests
from requests.auth import HTTPBasicAuth

LAKEFS_ENDPOINT = 'http://localhost:8000'
LAKEFS_ACCESS_KEY = 'AKIAIOSFDNN7EXAMPLEQ'
LAKEFS_SECRET_KEY = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'

# Step 1: Write data via S3 gateway
s3 = boto3.client('s3',
    endpoint_url=LAKEFS_ENDPOINT,
    aws_access_key_id=LAKEFS_ACCESS_KEY,
    aws_secret_access_key=LAKEFS_SECRET_KEY,
)

s3.put_object(
    Bucket='my-repo',
    Key='main/data/sales_2026.csv',
    Body=b'date,amount\n2026-01-01,100\n2026-01-02,200\n',
    ContentType='text/csv',
    Metadata={'source': 'etl-pipeline', 'batch_id': '42'}
)

s3.put_object(
    Bucket='my-repo',
    Key='main/data/customers_2026.csv',
    Body=b'id,name\n1,Alice\n2,Bob\n',
    ContentType='text/csv'
)

# Step 2: Commit all staged changes via lakeFS REST API
response = requests.post(
    f'{LAKEFS_ENDPOINT}/api/v1/repositories/my-repo/branches/main/commits',
    json={
        'message': 'Add 2026 sales and customer data',
        'metadata': {
            'pipeline': 'daily-etl',
            'batch_id': '42',
            'source': 's3-gateway'
        }
    },
    auth=HTTPBasicAuth(LAKEFS_ACCESS_KEY, LAKEFS_SECRET_KEY)
)

commit = response.json()
print(f"Commit ID: {commit['id']}")
print(f"Timestamp: {commit['creation_date']}")
print(f"Message:   {commit['message']}")

Python: Spark write + commit workflow

from pyspark.sql import SparkSession
import requests
from requests.auth import HTTPBasicAuth

LAKEFS_ENDPOINT = 'http://localhost:8000'
LAKEFS_ACCESS_KEY = 'AKIAIOSFDNN7EXAMPLEQ'
LAKEFS_SECRET_KEY = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'

# Step 1: Write data via Spark using S3A
spark = SparkSession.builder \
    .config("spark.hadoop.fs.s3a.endpoint", LAKEFS_ENDPOINT) \
    .config("spark.hadoop.fs.s3a.access.key", LAKEFS_ACCESS_KEY) \
    .config("spark.hadoop.fs.s3a.secret.key", LAKEFS_SECRET_KEY) \
    .config("spark.hadoop.fs.s3a.path.style.access", "true") \
    .getOrCreate()

df = spark.createDataFrame([
    (1, "Alice", 100.0),
    (2, "Bob", 200.0),
], ["id", "name", "amount"])

df.write.mode("overwrite").parquet("s3a://my-repo/main/data/output/")

# Step 2: Commit via lakeFS REST API
response = requests.post(
    f'{LAKEFS_ENDPOINT}/api/v1/repositories/my-repo/branches/main/commits',
    json={
        'message': 'Spark job: write output dataset',
        'metadata': {'job_name': 'daily_aggregation'}
    },
    auth=HTTPBasicAuth(LAKEFS_ACCESS_KEY, LAKEFS_SECRET_KEY)
)

print(f"Committed: {response.json()['id']}")

cURL: Commit via REST API

# Commit staged changes on the main branch
curl -X POST \
  'http://localhost:8000/api/v1/repositories/my-repo/branches/main/commits' \
  -H 'Content-Type: application/json' \
  -u 'AKIAIOSFDNN7EXAMPLEQ:wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY' \
  -d '{
    "message": "Daily data ingestion complete",
    "metadata": {
      "pipeline": "daily-etl",
      "run_id": "2026-02-08-001"
    }
  }'

# Example response:
# {
#   "id": "a1b2c3d4e5f6...",
#   "parents": ["f6e5d4c3b2a1..."],
#   "committer": "admin",
#   "message": "Daily data ingestion complete",
#   "creation_date": 1770508800,
#   "meta_range_id": "...",
#   "metadata": {"pipeline": "daily-etl", "run_id": "2026-02-08-001"}
# }

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment