Principle:Treeverse LakeFS Import Source Preparation
| Knowledge Sources | |
|---|---|
| Domains | Data_Import, Data_Engineering |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Import source preparation is the process of defining external object storage locations and mapping them to destination paths within a lakeFS repository for zero-copy data ingestion.
Description
When working with large-scale data lakes, organizations frequently need to bring existing data stored in external object storage (Amazon S3, Google Cloud Storage, or Azure Blob Storage) under version control. Rather than physically copying data into lakeFS-managed storage, lakeFS supports a zero-copy import mechanism: only the metadata (object paths, sizes, checksums) is imported, while the actual data remains in its original location.
Import source preparation is the critical first step in this workflow. It involves:
- Identifying the external storage locations that contain the data to import
- Choosing the appropriate source type for each location:
- common_prefix -- imports an entire directory tree from a given path prefix (e.g.,
s3://bucket/data/imports all objects under that prefix) - object -- imports a single, specific file (e.g.,
s3://bucket/data/file.parquet)
- common_prefix -- imports an entire directory tree from a given path prefix (e.g.,
- Mapping each source to a destination path within the lakeFS branch, which determines where the imported objects will appear in the repository's namespace
The source path must match the blockstore type configured for the lakeFS installation. For example, if lakeFS is configured to use S3 as its backing store, import paths must use the s3:// scheme.
Usage
Use import source preparation when:
- Onboarding existing data -- Bringing legacy datasets stored in cloud object storage under lakeFS version control for the first time
- Incremental ingestion -- Periodically importing new data partitions (e.g., daily data drops) from an external landing zone into a versioned branch
- Multi-source aggregation -- Combining data from multiple storage prefixes or individual files into a single lakeFS namespace for unified querying
- Data reorganization -- Remapping the directory structure of external data by specifying different destination prefixes than the source paths
Theoretical Basis
The import source preparation pattern is grounded in the separation of data plane and metadata plane operations. In traditional data lake architectures, data ingestion requires physically moving or copying bytes. Zero-copy import decouples these concerns:
TRADITIONAL IMPORT:
external_storage --> [copy bytes] --> lakefs_storage --> [record metadata]
ZERO-COPY IMPORT:
external_storage --> [record metadata only] --> lakefs_metadata_layer
(data stays in place)
The source preparation step defines a set of import locations, each described by a triple:
ImportLocation = (type, path, destination)
where:
type := common_prefix | object
path := URI pointing to external storage (s3://, gs://, https://*.blob.core.windows.net/)
destination := relative path within the lakeFS branch
For a common_prefix source, the import process enumerates all objects under the given path prefix and maps them to the destination by replacing the source prefix with the destination prefix:
Given:
source path = "s3://bucket/raw/2024/"
destination = "imported/2024/"
Then:
"s3://bucket/raw/2024/file_a.parquet" --> "imported/2024/file_a.parquet"
"s3://bucket/raw/2024/sub/file_b.csv" --> "imported/2024/sub/file_b.csv"
For an object source, the mapping is one-to-one: the single external object is placed at the exact destination path specified.
Multiple import locations can be batched into a single import operation, allowing complex data layouts to be constructed atomically within a single commit.