Workflow:Treeverse LakeFS S3 Gateway Integration

Knowledge Sources	lakeFS lakeFS Documentation S3 Gateway
Domains	Data_Engineering, Data_Lake_Management, Cloud_Storage
Last Updated	2026-02-08 10:00 GMT

Overview

End-to-end process for accessing lakeFS repositories through the S3-compatible gateway, enabling seamless integration with existing data tools such as Spark, Hive, AWS Athena, DuckDB, and Presto.

Description

This workflow describes how to use the lakeFS S3 gateway to interact with versioned data using standard S3 protocol and tools. The S3 gateway translates S3 API calls into lakeFS operations, allowing any S3-compatible application to read and write versioned data without modification. The gateway maps S3 bucket names to lakeFS repository/branch pairs, supports multipart uploads, pre-signed URLs, and object tagging. This enables transparent data versioning for existing data infrastructure.

Usage

Execute this workflow when you need to integrate lakeFS with existing S3-compatible data tools and frameworks without modifying application code. Common triggers include: connecting Spark jobs to lakeFS-managed data, reading lakeFS data from AWS Athena or Presto, using DuckDB to query versioned Parquet files, integrating with ETL tools that support S3 protocol, or enabling data scientists to access versioned datasets using familiar S3 URIs.

Execution Steps

Step 1: Configure S3 Gateway Endpoint

Configure data tools to use the lakeFS server as an S3 endpoint. The lakeFS server exposes an S3-compatible API on the same port as the main API. Applications connect using the lakeFS endpoint URL and lakeFS access credentials (access key ID and secret access key) in place of AWS credentials.

Key considerations:

The S3 gateway runs on the same lakeFS server (typically port 8000)
Applications use lakeFS access key credentials for authentication
The endpoint URL replaces the standard AWS S3 endpoint in client configuration
Path-style addressing is used (not virtual-hosted-style)

Step 2: Map Repository and Branch to S3 Path

Construct S3-compatible paths that reference lakeFS repositories and branches. The S3 gateway uses a convention where the S3 "bucket" maps to a lakeFS repository and the object key prefix encodes the branch or reference. The path format is s3://repository/branch/path/to/object.

Key considerations:

The S3 bucket name corresponds to the lakeFS repository name
The first path segment after the bucket corresponds to the branch or reference
Objects within a branch are accessed using their full path after the branch prefix
Both branch names and commit IDs can be used as references

Step 3: Read Data via S3 Protocol

Use standard S3 read operations (GetObject, ListObjects, HeadObject) through the gateway to access versioned data. Data tools issue standard S3 API calls that the gateway translates to lakeFS object reads. This enables SQL engines, notebooks, and other tools to query versioned datasets transparently.

Key considerations:

All standard S3 read operations are supported
Pagination and prefix filtering work as expected
Pre-signed URLs can be generated for temporary access
Read operations can target any branch, tag, or commit reference

Step 4: Write Data via S3 Protocol

Use standard S3 write operations (PutObject, multipart upload, DeleteObject, CopyObject) through the gateway to modify data on branches. Writes through the S3 gateway behave the same as writes through the lakeFS API — changes are staged on the branch until committed.

Key considerations:

Writes are only permitted on mutable branches (not tags or commits)
Multipart uploads are supported for large objects
Object deletion is supported via the S3 DeleteObject API
Writes through the gateway still require an explicit lakeFS commit to finalize

Step 5: Commit and Manage via lakeFS API

After writing data through the S3 gateway, use the lakeFS API or CLI to commit changes, create branches, merge, and perform other version control operations. The S3 gateway handles data I/O, while lifecycle management (commits, merges, tags) is performed through the lakeFS-native interface.

Key considerations:

The S3 gateway does not expose commit/merge/branch operations
Use the lakeFS REST API, Python SDK, or lakectl CLI for version control operations
Changes written via S3 gateway appear as uncommitted until explicitly committed
This hybrid approach (S3 for data, lakeFS API for versioning) is the standard pattern

Execution Diagram

GitHub URL

Workflow Repository