Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Mage ai Mage ai Sorted Data Bookmark Strategy

From Leeroopedia




Knowledge Sources
Domains Optimization, Data_Integration
Last Updated 2026-02-09 07:00 GMT

Overview

Emit state after every record when data is sorted; defer to batch end when unsorted, trading safety for efficiency.

Description

The Source base class assumes data is sorted ascending on the bookmark column by default (`is_sorted=True`). When sorted, it emits a STATE message after every record, providing maximum recovery granularity. When `is_sorted=False`, it collects the maximum bookmark value across all records and emits a single STATE at the end of each batch, reducing I/O overhead but increasing the window of data loss on failure.

Usage

Apply this heuristic when:

  • Building a new source connector: Decide whether your data source returns records sorted by the bookmark column.
  • Debugging duplicate data after failure recovery: Check if `is_sorted` is set correctly; incorrect setting leads to incorrect bookmark values.
  • Optimizing high-volume extraction: Set `is_sorted=False` if the source cannot guarantee sort order, to avoid emitting incorrect intermediate bookmarks.

The Insight (Rule of Thumb)

  • Action: Set `is_sorted=True` (default) when the data source guarantees ascending order on the bookmark (replication key) column. Set `is_sorted=False` when order is not guaranteed.
  • Value: `is_sorted=True` is the default in `Source.__init__()`.
  • Trade-off (sorted): One STATE per record = maximum recovery safety + higher I/O overhead.
  • Trade-off (unsorted): One STATE per batch = lower I/O overhead + larger replay window on failure.
  • Risk: If `is_sorted=True` but data is actually unsorted, intermediate STATE messages may contain non-maximum bookmark values, causing data gaps on recovery.

Reasoning

Incremental replication relies on bookmark values to resume from where the last sync left off. If data arrives in ascending order (e.g., `ORDER BY updated_at ASC`), each record's bookmark is guaranteed to be greater than or equal to the previous one, so writing state after each record is safe.

If data is unsorted (e.g., API endpoints returning records in arbitrary order), the bookmark at row N may not be the maximum seen so far. Emitting it as state would cause the next sync to miss records with higher bookmark values that appeared earlier in the batch.

The TODO comment in the source code (`TODO (tommy dang): indicate whether data is sorted ascending on bookmark value`) shows this is a known area requiring further documentation.

Code Evidence

Default sorted flag from `sources/base.py:44,91-92`:

is_sorted: bool = True,
# TODO (tommy dang): indicate whether data is sorted ascending on bookmark value
self.is_sorted = is_sorted

Sorted path (emit state per record) from `sources/base.py:493-504`:

if self.is_sorted:
    state = {}

    for _, col in enumerate(bookmark_properties):
        singer.write_bookmark(
            state,
            tap_stream_id,
            col,
            row.get(col),
        )

    write_state(state)

Unsorted path (collect max, emit at end) from `sources/base.py:505-510`:

else:
    # If data unsorted, save max value until end of writes
    max_bookmark = max(
        max_bookmark,
        [row.get(col) for col in bookmark_properties],
    )

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment