Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Treeverse LakeFS Import Initiation

From Leeroopedia


Knowledge Sources
Domains Data_Import, Data_Engineering
Last Updated 2026-02-08 00:00 GMT

Overview

Import initiation is the act of triggering an asynchronous zero-copy data import from external object storage into a lakeFS branch, receiving a job identifier for subsequent progress tracking.

Description

In data engineering workflows, importing large volumes of data from external storage into a versioned data lake cannot be done synchronously. A single import operation may need to catalog millions of objects, build internal metadata structures (metaranges), and create a commit -- a process that can take minutes or even hours for very large datasets.

The import initiation principle addresses this by adopting an asynchronous job submission pattern:

  1. The client sends a request describing what to import (source locations, destination paths, commit metadata)
  2. The server validates the request, enqueues the import job, and immediately returns a unique job identifier
  3. The client uses this identifier to poll for progress and eventual completion

This pattern provides several key benefits:

  • Non-blocking operation -- The client is not held waiting on a potentially long HTTP connection. Network timeouts, load balancer limits, and client-side resource constraints are avoided.
  • Scalability -- The server can queue and process import jobs at its own pace, throttling resource usage for concurrent imports.
  • Atomicity -- The entire import is committed as a single atomic operation. Either all objects are successfully cataloged and a commit is created, or the import fails without partial state.
  • Idempotency considerations -- The force flag allows overriding uncommitted changes on the target branch, giving the client control over conflict resolution.

The import creates a new commit on the target branch containing metadata entries for all imported objects. The actual data bytes are never copied; lakeFS records pointers to the original external storage locations.

Usage

Use import initiation when:

  • Starting a data onboarding pipeline -- Triggering the initial import of external datasets into lakeFS as part of an automated ETL/ELT workflow
  • Scheduled batch imports -- Running periodic (hourly, daily) imports of new data partitions from a landing zone
  • Programmatic data ingestion -- Integrating lakeFS import into scripts, Airflow DAGs, or CI/CD pipelines that need to kick off imports and later check results
  • Large-scale catalog operations -- Importing datasets with millions of objects where synchronous approaches would time out

Theoretical Basis

The import initiation pattern follows the well-established asynchronous job submission model used in distributed systems:

SYNCHRONOUS (blocking) pattern:
  Client ---[request]--> Server ---[process...long wait...]--> Client
  Problem: timeouts, resource holding, no progress visibility

ASYNCHRONOUS (non-blocking) pattern:
  Client ---[submit job]--> Server ---[job_id]--> Client    (immediate)
  Client ---[poll status]--> Server ---[progress]--> Client  (repeated)
  Client ---[poll status]--> Server ---[result]--> Client    (final)

The import operation follows a state machine:

  SUBMITTED --> IN_PROGRESS --> COMPLETED
                    |
                    +--> FAILED
                    |
                    +--> CANCELED

Upon initiation, the server:

  1. Validates the import request (checks that source paths are well-formed, branch exists, blockstore type matches)
  2. Checks for conflicts -- if the branch has uncommitted changes and force is not set, the operation is rejected
  3. Enqueues the job and assigns a unique identifier
  4. Returns the identifier immediately with HTTP status 202 (Accepted)

The server then processes the job asynchronously:

for each ImportLocation in request.paths:
    if location.type == common_prefix:
        enumerate all objects under location.path
        for each object:
            record metadata entry at (destination + relative_suffix)
    else if location.type == object:
        record metadata entry at destination

build metarange from all metadata entries
create commit on target branch with metarange

The 202 status code is semantically important: it indicates that the request has been accepted for processing but processing is not yet complete, distinguishing it from a 200 (success) or 201 (created) response.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment