Heuristic:Mage ai Mage ai Sorted Data Bookmark Strategy
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Data_Integration |
| Last Updated | 2026-02-09 07:00 GMT |
Overview
Emit state after every record when data is sorted; defer to batch end when unsorted, trading safety for efficiency.
Description
The Source base class assumes data is sorted ascending on the bookmark column by default (`is_sorted=True`). When sorted, it emits a STATE message after every record, providing maximum recovery granularity. When `is_sorted=False`, it collects the maximum bookmark value across all records and emits a single STATE at the end of each batch, reducing I/O overhead but increasing the window of data loss on failure.
Usage
Apply this heuristic when:
- Building a new source connector: Decide whether your data source returns records sorted by the bookmark column.
- Debugging duplicate data after failure recovery: Check if `is_sorted` is set correctly; incorrect setting leads to incorrect bookmark values.
- Optimizing high-volume extraction: Set `is_sorted=False` if the source cannot guarantee sort order, to avoid emitting incorrect intermediate bookmarks.
The Insight (Rule of Thumb)
- Action: Set `is_sorted=True` (default) when the data source guarantees ascending order on the bookmark (replication key) column. Set `is_sorted=False` when order is not guaranteed.
- Value: `is_sorted=True` is the default in `Source.__init__()`.
- Trade-off (sorted): One STATE per record = maximum recovery safety + higher I/O overhead.
- Trade-off (unsorted): One STATE per batch = lower I/O overhead + larger replay window on failure.
- Risk: If `is_sorted=True` but data is actually unsorted, intermediate STATE messages may contain non-maximum bookmark values, causing data gaps on recovery.
Reasoning
Incremental replication relies on bookmark values to resume from where the last sync left off. If data arrives in ascending order (e.g., `ORDER BY updated_at ASC`), each record's bookmark is guaranteed to be greater than or equal to the previous one, so writing state after each record is safe.
If data is unsorted (e.g., API endpoints returning records in arbitrary order), the bookmark at row N may not be the maximum seen so far. Emitting it as state would cause the next sync to miss records with higher bookmark values that appeared earlier in the batch.
The TODO comment in the source code (`TODO (tommy dang): indicate whether data is sorted ascending on bookmark value`) shows this is a known area requiring further documentation.
Code Evidence
Default sorted flag from `sources/base.py:44,91-92`:
is_sorted: bool = True,
# TODO (tommy dang): indicate whether data is sorted ascending on bookmark value
self.is_sorted = is_sorted
Sorted path (emit state per record) from `sources/base.py:493-504`:
if self.is_sorted:
state = {}
for _, col in enumerate(bookmark_properties):
singer.write_bookmark(
state,
tap_stream_id,
col,
row.get(col),
)
write_state(state)
Unsorted path (collect max, emit at end) from `sources/base.py:505-510`:
else:
# If data unsorted, save max value until end of writes
max_bookmark = max(
max_bookmark,
[row.get(col) for col in bookmark_properties],
)