Principle:Treeverse LakeFS Import Initiation
| Knowledge Sources | |
|---|---|
| Domains | Data_Import, Data_Engineering |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Import initiation is the act of triggering an asynchronous zero-copy data import from external object storage into a lakeFS branch, receiving a job identifier for subsequent progress tracking.
Description
In data engineering workflows, importing large volumes of data from external storage into a versioned data lake cannot be done synchronously. A single import operation may need to catalog millions of objects, build internal metadata structures (metaranges), and create a commit -- a process that can take minutes or even hours for very large datasets.
The import initiation principle addresses this by adopting an asynchronous job submission pattern:
- The client sends a request describing what to import (source locations, destination paths, commit metadata)
- The server validates the request, enqueues the import job, and immediately returns a unique job identifier
- The client uses this identifier to poll for progress and eventual completion
This pattern provides several key benefits:
- Non-blocking operation -- The client is not held waiting on a potentially long HTTP connection. Network timeouts, load balancer limits, and client-side resource constraints are avoided.
- Scalability -- The server can queue and process import jobs at its own pace, throttling resource usage for concurrent imports.
- Atomicity -- The entire import is committed as a single atomic operation. Either all objects are successfully cataloged and a commit is created, or the import fails without partial state.
- Idempotency considerations -- The
forceflag allows overriding uncommitted changes on the target branch, giving the client control over conflict resolution.
The import creates a new commit on the target branch containing metadata entries for all imported objects. The actual data bytes are never copied; lakeFS records pointers to the original external storage locations.
Usage
Use import initiation when:
- Starting a data onboarding pipeline -- Triggering the initial import of external datasets into lakeFS as part of an automated ETL/ELT workflow
- Scheduled batch imports -- Running periodic (hourly, daily) imports of new data partitions from a landing zone
- Programmatic data ingestion -- Integrating lakeFS import into scripts, Airflow DAGs, or CI/CD pipelines that need to kick off imports and later check results
- Large-scale catalog operations -- Importing datasets with millions of objects where synchronous approaches would time out
Theoretical Basis
The import initiation pattern follows the well-established asynchronous job submission model used in distributed systems:
SYNCHRONOUS (blocking) pattern:
Client ---[request]--> Server ---[process...long wait...]--> Client
Problem: timeouts, resource holding, no progress visibility
ASYNCHRONOUS (non-blocking) pattern:
Client ---[submit job]--> Server ---[job_id]--> Client (immediate)
Client ---[poll status]--> Server ---[progress]--> Client (repeated)
Client ---[poll status]--> Server ---[result]--> Client (final)
The import operation follows a state machine:
SUBMITTED --> IN_PROGRESS --> COMPLETED
|
+--> FAILED
|
+--> CANCELED
Upon initiation, the server:
- Validates the import request (checks that source paths are well-formed, branch exists, blockstore type matches)
- Checks for conflicts -- if the branch has uncommitted changes and
forceis not set, the operation is rejected - Enqueues the job and assigns a unique identifier
- Returns the identifier immediately with HTTP status 202 (Accepted)
The server then processes the job asynchronously:
for each ImportLocation in request.paths:
if location.type == common_prefix:
enumerate all objects under location.path
for each object:
record metadata entry at (destination + relative_suffix)
else if location.type == object:
record metadata entry at destination
build metarange from all metadata entries
create commit on target branch with metarange
The 202 status code is semantically important: it indicates that the request has been accepted for processing but processing is not yet complete, distinguishing it from a 200 (success) or 201 (created) response.