Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Treeverse LakeFS Import Progress Monitoring

From Leeroopedia


Knowledge Sources
Domains Data_Import, Data_Engineering
Last Updated 2026-02-08 00:00 GMT

Overview

Import progress monitoring is a polling-based pattern for tracking the status, progress, and eventual outcome of long-running asynchronous import operations.

Description

When a data import is initiated in lakeFS, the operation runs asynchronously on the server. Since imports can process millions of objects and take minutes to hours to complete, clients need a reliable mechanism to:

  • Determine whether the import is still running or has finished
  • Observe progress in terms of the number of objects ingested so far
  • Retrieve the final result -- either the commit reference created by a successful import, or the error details from a failed one
  • Detect stalls by comparing update timestamps between successive polls

The monitoring pattern is based on client-side polling: the client repeatedly queries the server for the current status of a specific import job, identified by its unique job ID. This approach is preferred over server-push mechanisms (such as WebSockets or webhooks) because:

  • It is stateless on the server side -- no long-lived connections to manage
  • It works across all network topologies, including those with proxies, firewalls, or load balancers that may terminate long-lived connections
  • It is simple to implement in any programming language or automation tool
  • It provides natural backpressure -- the client controls the polling frequency

Usage

Use import progress monitoring when:

  • Waiting for import completion in an automated pipeline (e.g., an Airflow task that must block until data is available)
  • Displaying progress to users in a CLI tool or web interface showing the number of objects processed
  • Implementing timeouts -- if the import has not completed within an expected window, the client can cancel it
  • Logging and auditing -- recording the progress trajectory of import operations for operational dashboards

Theoretical Basis

The polling-based monitoring pattern implements a busy-wait with backoff strategy on the client side:

POLL-BASED MONITORING LOOP:

  job_id = initiate_import(...)
  previous_update_time = null

  loop:
      wait(polling_interval)
      status = get_import_status(job_id)

      if status.error is not null:
          handle_failure(status.error)
          break

      if status.update_time == previous_update_time:
          log_warning("Import may be stalled")

      previous_update_time = status.update_time
      log_progress(status.ingested_objects)

      if status.completed:
          commit_ref = status.commit
          handle_success(commit_ref)
          break

The status object returned by each poll contains the following key fields:

ImportStatus:
  completed        : boolean     -- true when the import has finished (success or failure)
  update_time      : timestamp   -- last time the status was updated; useful for detecting stalls
  ingested_objects : integer     -- count of objects processed so far (monotonically increasing)
  metarange_id     : string      -- ID of the constructed metarange (set on completion)
  commit           : Commit      -- the resulting commit object (set on successful completion)
  error            : Error       -- error details (set on failure)

The monitoring loop observes the following state transitions:

  [IN_PROGRESS]  -- completed=false, ingested_objects increasing, update_time advancing
       |
       +---> [COMPLETED] -- completed=true, commit is set, error is null
       |
       +---> [FAILED]    -- completed=true, error is set, commit may be null

Polling interval selection is an important design consideration. Too frequent polling wastes network bandwidth and server resources; too infrequent polling delays detection of completion. A reasonable default is 2-5 seconds, as used in the lakeFS integration test suite (esti/import_test.go), which polls every 2 seconds.

The update_time field provides a crucial liveness signal. If the update time does not change between successive polls, the import may be stalled due to a server-side issue, allowing the client to take corrective action (alert, cancel, retry).

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment