Heuristic:ChenghaoMou Text dedup Suffix Array Merge Strategy
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Text_Deduplication, Suffix_Arrays |
| Last Updated | 2026-02-14 21:00 GMT |
Overview
Suffix array deduplication offers two merge strategies: `"longest"` (default, conservative) and `"overlapping"` (aggressive), with important correctness implications for overlapping duplicate spans.
Description
When the suffix array algorithm finds duplicate substrings, those substrings often overlap. For example, if `[2,4]` and `[3,5]` are both identified as duplicates, the question is whether the combined range `[2,5]` should also be treated as a duplicate. The `"overlapping"` strategy merges all overlapping intervals into one, while the `"longest"` strategy keeps individual duplicate spans without merging them (only removing fully contained substrings).
The default `"longest"` strategy is more conservative and correct because overlapping substrings being individually duplicated does not imply their union is also duplicated elsewhere.
Usage
Use `merge_strategy="longest"` (default) for most deduplication tasks. Switch to `merge_strategy="overlapping"` only when you want to aggressively remove all text within overlapping duplicate regions and accept potential over-deletion. The "overlapping" strategy is more aggressive and removes more text.
The Insight (Rule of Thumb)
- Action: Use the default `merge_strategy="longest"` for safe deduplication.
- Value: `"longest"` removes only individually confirmed duplicate spans; `"overlapping"` merges them.
- Trade-off: `"longest"` may leave some redundant text but avoids removing non-duplicate content. `"overlapping"` removes more text but may over-delete.
- Example: Given duplicates `[2,4]` and `[3,5]`:
- `"longest"` keeps both as separate spans
- `"overlapping"` merges to `[2,5]`
Reasoning
The key insight is documented in the source: "when [2, 4] and [3, 5] are duplicates, [2, 5] might be not." Two overlapping duplicate spans may each independently occur elsewhere in the corpus, but the combined span may be unique to this document. Merging them into `[2, 5]` and deleting the entire range could remove text that should have been preserved.
The `"longest"` strategy works by sorting intervals and then only skipping substrings that are fully contained within a previously seen interval. The `"overlapping"` strategy merges any intervals that touch or overlap.
Code Evidence
Strategy documentation from `src/text_dedup/config/algorithms/suffix_array.py:21-36`:
merge_strategy: Literal["longest", "overlapping"] = "longest"
@staticmethod
def merge_intervals(
intervals: list[slice],
merge_strategy: Literal["longest", "overlapping"] = "longest",
) -> list[slice]:
"""
Merge overlapping intervals.
Parameters
----------
merge_strategy : Literal["longest", "overlapping"]
Strategy to merge intervals, by default "longest"
"overlapping": merge overlapping intervals
"longest": only ignore duplicate substrings, this is useful because
when [2, 4] and [3, 5] are duplicates, [2, 5] might be not
"""
The "longest" strategy logic from `src/text_dedup/config/algorithms/suffix_array.py:107-111`:
elif merge_strategy == "longest":
if current.stop <= prev.stop: # ignore substrings
continue
else:
merged.append(current)
The "overlapping" strategy logic from `src/text_dedup/config/algorithms/suffix_array.py:102-106`:
if merge_strategy == "overlapping":
if prev.stop >= current.start:
merged[-1] = slice(prev.start, max(prev.stop, current.stop))
else:
merged.append(current)