Workflow:Treeverse LakeFS S3 Gateway Integration
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Data_Lake_Management, Cloud_Storage |
| Last Updated | 2026-02-08 10:00 GMT |
Overview
End-to-end process for accessing lakeFS repositories through the S3-compatible gateway, enabling seamless integration with existing data tools such as Spark, Hive, AWS Athena, DuckDB, and Presto.
Description
This workflow describes how to use the lakeFS S3 gateway to interact with versioned data using standard S3 protocol and tools. The S3 gateway translates S3 API calls into lakeFS operations, allowing any S3-compatible application to read and write versioned data without modification. The gateway maps S3 bucket names to lakeFS repository/branch pairs, supports multipart uploads, pre-signed URLs, and object tagging. This enables transparent data versioning for existing data infrastructure.
Usage
Execute this workflow when you need to integrate lakeFS with existing S3-compatible data tools and frameworks without modifying application code. Common triggers include: connecting Spark jobs to lakeFS-managed data, reading lakeFS data from AWS Athena or Presto, using DuckDB to query versioned Parquet files, integrating with ETL tools that support S3 protocol, or enabling data scientists to access versioned datasets using familiar S3 URIs.
Execution Steps
Step 1: Configure S3 Gateway Endpoint
Configure data tools to use the lakeFS server as an S3 endpoint. The lakeFS server exposes an S3-compatible API on the same port as the main API. Applications connect using the lakeFS endpoint URL and lakeFS access credentials (access key ID and secret access key) in place of AWS credentials.
Key considerations:
- The S3 gateway runs on the same lakeFS server (typically port 8000)
- Applications use lakeFS access key credentials for authentication
- The endpoint URL replaces the standard AWS S3 endpoint in client configuration
- Path-style addressing is used (not virtual-hosted-style)
Step 2: Map Repository and Branch to S3 Path
Construct S3-compatible paths that reference lakeFS repositories and branches. The S3 gateway uses a convention where the S3 "bucket" maps to a lakeFS repository and the object key prefix encodes the branch or reference. The path format is s3://repository/branch/path/to/object.
Key considerations:
- The S3 bucket name corresponds to the lakeFS repository name
- The first path segment after the bucket corresponds to the branch or reference
- Objects within a branch are accessed using their full path after the branch prefix
- Both branch names and commit IDs can be used as references
Step 3: Read Data via S3 Protocol
Use standard S3 read operations (GetObject, ListObjects, HeadObject) through the gateway to access versioned data. Data tools issue standard S3 API calls that the gateway translates to lakeFS object reads. This enables SQL engines, notebooks, and other tools to query versioned datasets transparently.
Key considerations:
- All standard S3 read operations are supported
- Pagination and prefix filtering work as expected
- Pre-signed URLs can be generated for temporary access
- Read operations can target any branch, tag, or commit reference
Step 4: Write Data via S3 Protocol
Use standard S3 write operations (PutObject, multipart upload, DeleteObject, CopyObject) through the gateway to modify data on branches. Writes through the S3 gateway behave the same as writes through the lakeFS API — changes are staged on the branch until committed.
Key considerations:
- Writes are only permitted on mutable branches (not tags or commits)
- Multipart uploads are supported for large objects
- Object deletion is supported via the S3 DeleteObject API
- Writes through the gateway still require an explicit lakeFS commit to finalize
Step 5: Commit and Manage via lakeFS API
After writing data through the S3 gateway, use the lakeFS API or CLI to commit changes, create branches, merge, and perform other version control operations. The S3 gateway handles data I/O, while lifecycle management (commits, merges, tags) is performed through the lakeFS-native interface.
Key considerations:
- The S3 gateway does not expose commit/merge/branch operations
- Use the lakeFS REST API, Python SDK, or lakectl CLI for version control operations
- Changes written via S3 gateway appear as uncommitted until explicitly committed
- This hybrid approach (S3 for data, lakeFS API for versioning) is the standard pattern