Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Treeverse LakeFS S3 Gateway Integration

From Leeroopedia


Knowledge Sources
Domains Data_Engineering, Data_Lake_Management, Cloud_Storage
Last Updated 2026-02-08 10:00 GMT

Overview

End-to-end process for accessing lakeFS repositories through the S3-compatible gateway, enabling seamless integration with existing data tools such as Spark, Hive, AWS Athena, DuckDB, and Presto.

Description

This workflow describes how to use the lakeFS S3 gateway to interact with versioned data using standard S3 protocol and tools. The S3 gateway translates S3 API calls into lakeFS operations, allowing any S3-compatible application to read and write versioned data without modification. The gateway maps S3 bucket names to lakeFS repository/branch pairs, supports multipart uploads, pre-signed URLs, and object tagging. This enables transparent data versioning for existing data infrastructure.

Usage

Execute this workflow when you need to integrate lakeFS with existing S3-compatible data tools and frameworks without modifying application code. Common triggers include: connecting Spark jobs to lakeFS-managed data, reading lakeFS data from AWS Athena or Presto, using DuckDB to query versioned Parquet files, integrating with ETL tools that support S3 protocol, or enabling data scientists to access versioned datasets using familiar S3 URIs.

Execution Steps

Step 1: Configure S3 Gateway Endpoint

Configure data tools to use the lakeFS server as an S3 endpoint. The lakeFS server exposes an S3-compatible API on the same port as the main API. Applications connect using the lakeFS endpoint URL and lakeFS access credentials (access key ID and secret access key) in place of AWS credentials.

Key considerations:

  • The S3 gateway runs on the same lakeFS server (typically port 8000)
  • Applications use lakeFS access key credentials for authentication
  • The endpoint URL replaces the standard AWS S3 endpoint in client configuration
  • Path-style addressing is used (not virtual-hosted-style)

Step 2: Map Repository and Branch to S3 Path

Construct S3-compatible paths that reference lakeFS repositories and branches. The S3 gateway uses a convention where the S3 "bucket" maps to a lakeFS repository and the object key prefix encodes the branch or reference. The path format is s3://repository/branch/path/to/object.

Key considerations:

  • The S3 bucket name corresponds to the lakeFS repository name
  • The first path segment after the bucket corresponds to the branch or reference
  • Objects within a branch are accessed using their full path after the branch prefix
  • Both branch names and commit IDs can be used as references

Step 3: Read Data via S3 Protocol

Use standard S3 read operations (GetObject, ListObjects, HeadObject) through the gateway to access versioned data. Data tools issue standard S3 API calls that the gateway translates to lakeFS object reads. This enables SQL engines, notebooks, and other tools to query versioned datasets transparently.

Key considerations:

  • All standard S3 read operations are supported
  • Pagination and prefix filtering work as expected
  • Pre-signed URLs can be generated for temporary access
  • Read operations can target any branch, tag, or commit reference

Step 4: Write Data via S3 Protocol

Use standard S3 write operations (PutObject, multipart upload, DeleteObject, CopyObject) through the gateway to modify data on branches. Writes through the S3 gateway behave the same as writes through the lakeFS API — changes are staged on the branch until committed.

Key considerations:

  • Writes are only permitted on mutable branches (not tags or commits)
  • Multipart uploads are supported for large objects
  • Object deletion is supported via the S3 DeleteObject API
  • Writes through the gateway still require an explicit lakeFS commit to finalize

Step 5: Commit and Manage via lakeFS API

After writing data through the S3 gateway, use the lakeFS API or CLI to commit changes, create branches, merge, and perform other version control operations. The S3 gateway handles data I/O, while lifecycle management (commits, merges, tags) is performed through the lakeFS-native interface.

Key considerations:

  • The S3 gateway does not expose commit/merge/branch operations
  • Use the lakeFS REST API, Python SDK, or lakectl CLI for version control operations
  • Changes written via S3 gateway appear as uncommitted until explicitly committed
  • This hybrid approach (S3 for data, lakeFS API for versioning) is the standard pattern

Execution Diagram

GitHub URL

Workflow Repository