Principle:Treeverse LakeFS S3 Gateway Configuration
| Knowledge Sources | |
|---|---|
| Domains | S3_Compatibility, Data_Integration |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Configuring S3-compatible access to versioned data lakes through the lakeFS S3 gateway.
Description
lakeFS provides an S3-compatible gateway that translates S3 protocol requests into lakeFS versioning operations. This gateway runs on the same server as the lakeFS API (typically on port 8000) and enables existing S3-compatible tools to work with versioned data without any code changes.
The S3 gateway supports the core subset of the S3 API that data tools rely on, including:
- GetObject and HeadObject for reading data
- PutObject and CreateMultipartUpload for writing data
- ListObjectsV2 for listing objects
- CopyObject for server-side copies
- DeleteObject and DeleteObjects for removing data
- Presigned URLs for delegated access
The key insight is that any tool capable of speaking the S3 protocol (Spark, Hive, Athena, DuckDB, Presto, pandas, AWS CLI, Minio client) can be pointed at the lakeFS S3 gateway by simply changing the endpoint URL and providing lakeFS credentials.
Usage
Use this principle when:
- Integrating existing S3-compatible data tools with a lakeFS data lake
- Configuring Spark, Hive, Presto, or other analytics engines to read/write versioned data
- Setting up ETL pipelines that need to operate on branched or versioned datasets
- Enabling data scientists to use familiar S3-based workflows (boto3, pandas, DuckDB) with version control
Theoretical Basis
The S3 gateway relies on a path-style addressing convention:
s3://{repository}/{branch_or_commit}/{path/to/object}
└── bucket ──┘└── key prefix ──┘└── object path ──┘
In this scheme:
- The S3 bucket name maps to the lakeFS repository name
- The first segment of the S3 object key maps to the lakeFS branch (or commit ID)
- The remainder of the S3 object key maps to the object path within lakeFS
This mapping is transparent to S3 clients. They see a standard S3 bucket containing objects organized in a directory-like hierarchy. The versioning semantics are encoded entirely within the path convention.
Configuration requirements:
- Endpoint URL: Must point to the lakeFS server (e.g.,
http://localhost:8000) - Force path-style addressing: Must be enabled (lakeFS does not support virtual-hosted-style)
- Credentials: lakeFS access key ID and secret access key (created via the lakeFS API or UI)
- Region: Can be set to any value (e.g.,
us-east-1); lakeFS ignores region
Pseudocode for client initialization:
s3_client = create_s3_client(
endpoint = "http://<lakefs-host>:8000",
credentials = (lakefs_access_key_id, lakefs_secret_access_key),
path_style = true,
region = "us-east-1" // any valid region string
)