Implementation:Treeverse LakeFS Commit Via S3 Workflow
| Knowledge Sources | |
|---|---|
| Domains | S3_Compatibility, REST_API |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
API endpoint for committing changes that were staged via the S3 gateway, bridging the S3 write protocol with lakeFS version control.
Description
This implementation documents the lakeFS REST API commit endpoint as used in the S3 gateway integration workflow. After writing objects through the S3 gateway (PutObject, CopyObject, DeleteObject), the commit endpoint is called to create an atomic version snapshot of all staged changes.
The endpoint is:
POST /api/v1/repositories/{repository}/branches/{branch}/commits
Key behavior: All objects written via S3 PutObject are automatically staged on the target branch. No separate "add" or "stage" step is needed before committing. The commit operation packages all staged changes into a single immutable snapshot.
Usage
Use this implementation when:
- Completing an S3-based data ingestion pipeline with a version commit
- Building automation that writes data via S3 tools and commits via REST API
- Creating atomic snapshots of data that was written through Spark, pandas, or AWS CLI via the S3 gateway
Code Reference
Source Location
- File:
api/swagger.yml - Lines: L4252-4292 (commit endpoint definition)
- Schemas: L651-673 (
CommitCreation), L600-630 (Commit) - Operation ID:
commit
Signature
# api/swagger.yml - commit endpoint
/repositories/{repository}/branches/{branch}/commits:
parameters:
- in: path
name: repository
required: true
schema:
type: string
- in: path
name: branch
required: true
schema:
type: string
post:
parameters:
- in: query
name: source_metarange
required: false
description: >
The source metarange to commit.
Branch must not have uncommitted changes.
schema:
type: string
tags:
- commits
operationId: commit
summary: create commit
requestBody:
required: true
content:
application/json:
schema:
$ref: "#/components/schemas/CommitCreation"
responses:
201:
description: commit
content:
application/json:
schema:
$ref: "#/components/schemas/Commit"
400:
$ref: "#/components/responses/ValidationError"
401:
$ref: "#/components/responses/Unauthorized"
403:
$ref: "#/components/responses/Forbidden"
404:
$ref: "#/components/responses/NotFound"
409:
$ref: "#/components/responses/Conflict"
412:
$ref: "#/components/responses/PreconditionFailed"
429:
description: too many requests
Import
import requests
# Python requests library for calling the lakeFS REST API
LAKEFS_ENDPOINT = 'http://localhost:8000'
LAKEFS_ACCESS_KEY = 'AKIAIOSFDNN7EXAMPLEQ'
LAKEFS_SECRET_KEY = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'
I/O Contract
Inputs
| Parameter | Location | Type | Required | Description |
|---|---|---|---|---|
repository |
path | string | Yes | lakeFS repository name |
branch |
path | string | Yes | Branch name to commit on |
message |
body | string | Yes | Human-readable commit message |
metadata |
body | map[string]string | No | Arbitrary key-value pairs for automation and auditing |
date |
body | integer (int64) | No | Override creation date (Unix Epoch in seconds) |
allow_empty |
body | boolean | No | Allow commits with no changes (default: false)
|
force |
body | boolean | No | Force commit (default: false)
|
source_metarange |
query | string | No | Source metarange to commit (branch must have no uncommitted changes) |
Outputs
| Field | Type | Description |
|---|---|---|
id |
string | Unique commit identifier (SHA-256 hash) |
parents |
[]string | Parent commit IDs (single parent for normal commits) |
committer |
string | The user who created the commit |
message |
string | The commit message |
creation_date |
integer (int64) | Unix Epoch in seconds |
meta_range_id |
string | Internal reference to the committed data range |
metadata |
map[string]string | User-provided metadata key-value pairs |
Error responses:
| HTTP Status | Description |
|---|---|
| 400 | Validation error (invalid request body) |
| 401 | Unauthorized (invalid or missing credentials) |
| 403 | Forbidden (insufficient permissions) |
| 404 | Repository or branch not found |
| 409 | Conflict (concurrent commit on same branch) |
| 412 | Precondition failed |
| 429 | Too many requests (rate limited) |
Usage Examples
Python: Full S3 write + commit workflow
import boto3
import requests
from requests.auth import HTTPBasicAuth
LAKEFS_ENDPOINT = 'http://localhost:8000'
LAKEFS_ACCESS_KEY = 'AKIAIOSFDNN7EXAMPLEQ'
LAKEFS_SECRET_KEY = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'
# Step 1: Write data via S3 gateway
s3 = boto3.client('s3',
endpoint_url=LAKEFS_ENDPOINT,
aws_access_key_id=LAKEFS_ACCESS_KEY,
aws_secret_access_key=LAKEFS_SECRET_KEY,
)
s3.put_object(
Bucket='my-repo',
Key='main/data/sales_2026.csv',
Body=b'date,amount\n2026-01-01,100\n2026-01-02,200\n',
ContentType='text/csv',
Metadata={'source': 'etl-pipeline', 'batch_id': '42'}
)
s3.put_object(
Bucket='my-repo',
Key='main/data/customers_2026.csv',
Body=b'id,name\n1,Alice\n2,Bob\n',
ContentType='text/csv'
)
# Step 2: Commit all staged changes via lakeFS REST API
response = requests.post(
f'{LAKEFS_ENDPOINT}/api/v1/repositories/my-repo/branches/main/commits',
json={
'message': 'Add 2026 sales and customer data',
'metadata': {
'pipeline': 'daily-etl',
'batch_id': '42',
'source': 's3-gateway'
}
},
auth=HTTPBasicAuth(LAKEFS_ACCESS_KEY, LAKEFS_SECRET_KEY)
)
commit = response.json()
print(f"Commit ID: {commit['id']}")
print(f"Timestamp: {commit['creation_date']}")
print(f"Message: {commit['message']}")
Python: Spark write + commit workflow
from pyspark.sql import SparkSession
import requests
from requests.auth import HTTPBasicAuth
LAKEFS_ENDPOINT = 'http://localhost:8000'
LAKEFS_ACCESS_KEY = 'AKIAIOSFDNN7EXAMPLEQ'
LAKEFS_SECRET_KEY = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'
# Step 1: Write data via Spark using S3A
spark = SparkSession.builder \
.config("spark.hadoop.fs.s3a.endpoint", LAKEFS_ENDPOINT) \
.config("spark.hadoop.fs.s3a.access.key", LAKEFS_ACCESS_KEY) \
.config("spark.hadoop.fs.s3a.secret.key", LAKEFS_SECRET_KEY) \
.config("spark.hadoop.fs.s3a.path.style.access", "true") \
.getOrCreate()
df = spark.createDataFrame([
(1, "Alice", 100.0),
(2, "Bob", 200.0),
], ["id", "name", "amount"])
df.write.mode("overwrite").parquet("s3a://my-repo/main/data/output/")
# Step 2: Commit via lakeFS REST API
response = requests.post(
f'{LAKEFS_ENDPOINT}/api/v1/repositories/my-repo/branches/main/commits',
json={
'message': 'Spark job: write output dataset',
'metadata': {'job_name': 'daily_aggregation'}
},
auth=HTTPBasicAuth(LAKEFS_ACCESS_KEY, LAKEFS_SECRET_KEY)
)
print(f"Committed: {response.json()['id']}")
cURL: Commit via REST API
# Commit staged changes on the main branch
curl -X POST \
'http://localhost:8000/api/v1/repositories/my-repo/branches/main/commits' \
-H 'Content-Type: application/json' \
-u 'AKIAIOSFDNN7EXAMPLEQ:wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY' \
-d '{
"message": "Daily data ingestion complete",
"metadata": {
"pipeline": "daily-etl",
"run_id": "2026-02-08-001"
}
}'
# Example response:
# {
# "id": "a1b2c3d4e5f6...",
# "parents": ["f6e5d4c3b2a1..."],
# "committer": "admin",
# "message": "Daily data ingestion complete",
# "creation_date": 1770508800,
# "meta_range_id": "...",
# "metadata": {"pipeline": "daily-etl", "run_id": "2026-02-08-001"}
# }