Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Treeverse LakeFS Import Source Preparation

From Leeroopedia


Knowledge Sources
Domains Data_Import, Data_Engineering
Last Updated 2026-02-08 00:00 GMT

Overview

Import source preparation is the process of defining external object storage locations and mapping them to destination paths within a lakeFS repository for zero-copy data ingestion.

Description

When working with large-scale data lakes, organizations frequently need to bring existing data stored in external object storage (Amazon S3, Google Cloud Storage, or Azure Blob Storage) under version control. Rather than physically copying data into lakeFS-managed storage, lakeFS supports a zero-copy import mechanism: only the metadata (object paths, sizes, checksums) is imported, while the actual data remains in its original location.

Import source preparation is the critical first step in this workflow. It involves:

  • Identifying the external storage locations that contain the data to import
  • Choosing the appropriate source type for each location:
    • common_prefix -- imports an entire directory tree from a given path prefix (e.g., s3://bucket/data/ imports all objects under that prefix)
    • object -- imports a single, specific file (e.g., s3://bucket/data/file.parquet)
  • Mapping each source to a destination path within the lakeFS branch, which determines where the imported objects will appear in the repository's namespace

The source path must match the blockstore type configured for the lakeFS installation. For example, if lakeFS is configured to use S3 as its backing store, import paths must use the s3:// scheme.

Usage

Use import source preparation when:

  • Onboarding existing data -- Bringing legacy datasets stored in cloud object storage under lakeFS version control for the first time
  • Incremental ingestion -- Periodically importing new data partitions (e.g., daily data drops) from an external landing zone into a versioned branch
  • Multi-source aggregation -- Combining data from multiple storage prefixes or individual files into a single lakeFS namespace for unified querying
  • Data reorganization -- Remapping the directory structure of external data by specifying different destination prefixes than the source paths

Theoretical Basis

The import source preparation pattern is grounded in the separation of data plane and metadata plane operations. In traditional data lake architectures, data ingestion requires physically moving or copying bytes. Zero-copy import decouples these concerns:

TRADITIONAL IMPORT:
  external_storage --> [copy bytes] --> lakefs_storage --> [record metadata]

ZERO-COPY IMPORT:
  external_storage --> [record metadata only] --> lakefs_metadata_layer
                   (data stays in place)

The source preparation step defines a set of import locations, each described by a triple:

ImportLocation = (type, path, destination)

where:
  type        := common_prefix | object
  path        := URI pointing to external storage (s3://, gs://, https://*.blob.core.windows.net/)
  destination := relative path within the lakeFS branch

For a common_prefix source, the import process enumerates all objects under the given path prefix and maps them to the destination by replacing the source prefix with the destination prefix:

Given:
  source path  = "s3://bucket/raw/2024/"
  destination  = "imported/2024/"

Then:
  "s3://bucket/raw/2024/file_a.parquet"  -->  "imported/2024/file_a.parquet"
  "s3://bucket/raw/2024/sub/file_b.csv"  -->  "imported/2024/sub/file_b.csv"

For an object source, the mapping is one-to-one: the single external object is placed at the exact destination path specified.

Multiple import locations can be batched into a single import operation, allowing complex data layouts to be constructed atomically within a single commit.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment