Principle:Treeverse LakeFS Import Progress Monitoring
| Knowledge Sources | |
|---|---|
| Domains | Data_Import, Data_Engineering |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Import progress monitoring is a polling-based pattern for tracking the status, progress, and eventual outcome of long-running asynchronous import operations.
Description
When a data import is initiated in lakeFS, the operation runs asynchronously on the server. Since imports can process millions of objects and take minutes to hours to complete, clients need a reliable mechanism to:
- Determine whether the import is still running or has finished
- Observe progress in terms of the number of objects ingested so far
- Retrieve the final result -- either the commit reference created by a successful import, or the error details from a failed one
- Detect stalls by comparing update timestamps between successive polls
The monitoring pattern is based on client-side polling: the client repeatedly queries the server for the current status of a specific import job, identified by its unique job ID. This approach is preferred over server-push mechanisms (such as WebSockets or webhooks) because:
- It is stateless on the server side -- no long-lived connections to manage
- It works across all network topologies, including those with proxies, firewalls, or load balancers that may terminate long-lived connections
- It is simple to implement in any programming language or automation tool
- It provides natural backpressure -- the client controls the polling frequency
Usage
Use import progress monitoring when:
- Waiting for import completion in an automated pipeline (e.g., an Airflow task that must block until data is available)
- Displaying progress to users in a CLI tool or web interface showing the number of objects processed
- Implementing timeouts -- if the import has not completed within an expected window, the client can cancel it
- Logging and auditing -- recording the progress trajectory of import operations for operational dashboards
Theoretical Basis
The polling-based monitoring pattern implements a busy-wait with backoff strategy on the client side:
POLL-BASED MONITORING LOOP:
job_id = initiate_import(...)
previous_update_time = null
loop:
wait(polling_interval)
status = get_import_status(job_id)
if status.error is not null:
handle_failure(status.error)
break
if status.update_time == previous_update_time:
log_warning("Import may be stalled")
previous_update_time = status.update_time
log_progress(status.ingested_objects)
if status.completed:
commit_ref = status.commit
handle_success(commit_ref)
break
The status object returned by each poll contains the following key fields:
ImportStatus:
completed : boolean -- true when the import has finished (success or failure)
update_time : timestamp -- last time the status was updated; useful for detecting stalls
ingested_objects : integer -- count of objects processed so far (monotonically increasing)
metarange_id : string -- ID of the constructed metarange (set on completion)
commit : Commit -- the resulting commit object (set on successful completion)
error : Error -- error details (set on failure)
The monitoring loop observes the following state transitions:
[IN_PROGRESS] -- completed=false, ingested_objects increasing, update_time advancing
|
+---> [COMPLETED] -- completed=true, commit is set, error is null
|
+---> [FAILED] -- completed=true, error is set, commit may be null
Polling interval selection is an important design consideration. Too frequent polling wastes network bandwidth and server resources; too infrequent polling delays detection of completion. A reasonable default is 2-5 seconds, as used in the lakeFS integration test suite (esti/import_test.go), which polls every 2 seconds.
The update_time field provides a crucial liveness signal. If the update time does not change between successive polls, the import may be stalled due to a server-side issue, allowing the client to take corrective action (alert, cancel, retry).