Principle:Treeverse LakeFS S3 Data Reading
| Knowledge Sources | |
|---|---|
| Domains | S3_Compatibility, Data_Integration |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Reading versioned data through S3-compatible protocol operations for transparent access to any version of data.
Description
Reading data through the lakeFS S3 gateway provides transparent access to any version of data stored in a lakeFS repository. Clients can read from:
- Specific branches to get the latest committed state plus any uncommitted changes
- Commit IDs in the key prefix for point-in-time access to historical snapshots
- Tags for named versions of the data
This enables tools like Spark, pandas, DuckDB, and Presto to read versioned data without any lakeFS-specific code. The S3 gateway translates standard S3 read operations into the appropriate lakeFS object retrieval.
Usage
Use this principle when:
- Reading data from a lakeFS repository using S3-compatible tools
- Accessing specific branches or historical commits through the S3 protocol
- Building data pipelines that need to read from versioned datasets
- Enabling data exploration tools (Jupyter, DuckDB) to query versioned data
Theoretical Basis
The S3 gateway supports the following read operations:
| S3 Operation | Description | lakeFS Translation |
|---|---|---|
| GetObject | Retrieve the content of an object | Read object bytes from the specified ref and path |
| HeadObject | Retrieve object metadata without body | Stat the object to get size, ETag, content type, user metadata |
| ListObjectsV2 | List objects with a prefix | List entries under the specified ref and path prefix |
| Presigned URL (GET) | Generate a time-limited download URL | Create a signed URL that grants temporary read access |
Key behaviors:
- GetObject returns the full object content along with metadata headers (ETag, Content-Type, Content-Length, user-defined metadata)
- HeadObject (StatObject) returns metadata without transferring the object body, useful for checking existence and size
- ListObjectsV2 supports prefix-based filtering, delimiter-based hierarchy, and pagination via continuation tokens
- Presigned URLs enable delegated read access without sharing credentials
- Conditional requests are supported:
If-Match(ETag),If-None-Match(ETag), and range requests for partial reads - A 404 Not Found is returned when the requested object does not exist on the specified branch
Pseudocode for reading versioned data:
// Read latest data on main branch
object = s3_client.get_object(bucket="my-repo", key="main/data/file.csv")
// Read data from a specific commit (point-in-time)
object = s3_client.get_object(bucket="my-repo", key="a1b2c3d4/data/file.csv")
// Check if object exists without downloading
head = s3_client.head_object(bucket="my-repo", key="main/data/file.csv")
size = head.content_length
// List all objects under a branch prefix
objects = s3_client.list_objects_v2(bucket="my-repo", prefix="main/data/")