Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Treeverse LakeFS S3 Data Reading

From Leeroopedia


Knowledge Sources
Domains S3_Compatibility, Data_Integration
Last Updated 2026-02-08 00:00 GMT

Overview

Reading versioned data through S3-compatible protocol operations for transparent access to any version of data.

Description

Reading data through the lakeFS S3 gateway provides transparent access to any version of data stored in a lakeFS repository. Clients can read from:

  • Specific branches to get the latest committed state plus any uncommitted changes
  • Commit IDs in the key prefix for point-in-time access to historical snapshots
  • Tags for named versions of the data

This enables tools like Spark, pandas, DuckDB, and Presto to read versioned data without any lakeFS-specific code. The S3 gateway translates standard S3 read operations into the appropriate lakeFS object retrieval.

Usage

Use this principle when:

  • Reading data from a lakeFS repository using S3-compatible tools
  • Accessing specific branches or historical commits through the S3 protocol
  • Building data pipelines that need to read from versioned datasets
  • Enabling data exploration tools (Jupyter, DuckDB) to query versioned data

Theoretical Basis

The S3 gateway supports the following read operations:

S3 Operation Description lakeFS Translation
GetObject Retrieve the content of an object Read object bytes from the specified ref and path
HeadObject Retrieve object metadata without body Stat the object to get size, ETag, content type, user metadata
ListObjectsV2 List objects with a prefix List entries under the specified ref and path prefix
Presigned URL (GET) Generate a time-limited download URL Create a signed URL that grants temporary read access

Key behaviors:

  1. GetObject returns the full object content along with metadata headers (ETag, Content-Type, Content-Length, user-defined metadata)
  2. HeadObject (StatObject) returns metadata without transferring the object body, useful for checking existence and size
  3. ListObjectsV2 supports prefix-based filtering, delimiter-based hierarchy, and pagination via continuation tokens
  4. Presigned URLs enable delegated read access without sharing credentials
  5. Conditional requests are supported: If-Match (ETag), If-None-Match (ETag), and range requests for partial reads
  6. A 404 Not Found is returned when the requested object does not exist on the specified branch

Pseudocode for reading versioned data:

// Read latest data on main branch
object = s3_client.get_object(bucket="my-repo", key="main/data/file.csv")

// Read data from a specific commit (point-in-time)
object = s3_client.get_object(bucket="my-repo", key="a1b2c3d4/data/file.csv")

// Check if object exists without downloading
head = s3_client.head_object(bucket="my-repo", key="main/data/file.csv")
size = head.content_length

// List all objects under a branch prefix
objects = s3_client.list_objects_v2(bucket="my-repo", prefix="main/data/")

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment