Principle:Treeverse LakeFS S3 Data Reading

Knowledge Sources	lakeFS lakeFS Documentation
Domains	S3_Compatibility, Data_Integration
Last Updated	2026-02-08 00:00 GMT

Overview

Reading versioned data through S3-compatible protocol operations for transparent access to any version of data.

Description

Reading data through the lakeFS S3 gateway provides transparent access to any version of data stored in a lakeFS repository. Clients can read from:

Specific branches to get the latest committed state plus any uncommitted changes
Commit IDs in the key prefix for point-in-time access to historical snapshots
Tags for named versions of the data

This enables tools like Spark, pandas, DuckDB, and Presto to read versioned data without any lakeFS-specific code. The S3 gateway translates standard S3 read operations into the appropriate lakeFS object retrieval.

Usage

Use this principle when:

Reading data from a lakeFS repository using S3-compatible tools
Accessing specific branches or historical commits through the S3 protocol
Building data pipelines that need to read from versioned datasets
Enabling data exploration tools (Jupyter, DuckDB) to query versioned data

Theoretical Basis

The S3 gateway supports the following read operations:

S3 Operation	Description	lakeFS Translation
GetObject	Retrieve the content of an object	Read object bytes from the specified ref and path
HeadObject	Retrieve object metadata without body	Stat the object to get size, ETag, content type, user metadata
ListObjectsV2	List objects with a prefix	List entries under the specified ref and path prefix
Presigned URL (GET)	Generate a time-limited download URL	Create a signed URL that grants temporary read access

Key behaviors:

GetObject returns the full object content along with metadata headers (ETag, Content-Type, Content-Length, user-defined metadata)
HeadObject (StatObject) returns metadata without transferring the object body, useful for checking existence and size
ListObjectsV2 supports prefix-based filtering, delimiter-based hierarchy, and pagination via continuation tokens
Presigned URLs enable delegated read access without sharing credentials
Conditional requests are supported: If-Match (ETag), If-None-Match (ETag), and range requests for partial reads
A 404 Not Found is returned when the requested object does not exist on the specified branch

Pseudocode for reading versioned data:

// Read latest data on main branch
object = s3_client.get_object(bucket="my-repo", key="main/data/file.csv")

// Read data from a specific commit (point-in-time)
object = s3_client.get_object(bucket="my-repo", key="a1b2c3d4/data/file.csv")

// Check if object exists without downloading
head = s3_client.head_object(bucket="my-repo", key="main/data/file.csv")
size = head.content_length

// List all objects under a branch prefix
objects = s3_client.list_objects_v2(bucket="my-repo", prefix="main/data/")

Related Pages

Implemented By

Implementation:Treeverse_LakeFS_S3_GetObject

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment