Heuristic:ChenghaoMou Text dedup Suffix Array Merge Strategy

Knowledge Sources	text-dedup Internal
Domains	Optimization, Text_Deduplication, Suffix_Arrays
Last Updated	2026-02-14 21:00 GMT

Overview

Suffix array deduplication offers two merge strategies: `"longest"` (default, conservative) and `"overlapping"` (aggressive), with important correctness implications for overlapping duplicate spans.

Description

When the suffix array algorithm finds duplicate substrings, those substrings often overlap. For example, if `[2,4]` and `[3,5]` are both identified as duplicates, the question is whether the combined range `[2,5]` should also be treated as a duplicate. The `"overlapping"` strategy merges all overlapping intervals into one, while the `"longest"` strategy keeps individual duplicate spans without merging them (only removing fully contained substrings).

The default `"longest"` strategy is more conservative and correct because overlapping substrings being individually duplicated does not imply their union is also duplicated elsewhere.

Usage

Use `merge_strategy="longest"` (default) for most deduplication tasks. Switch to `merge_strategy="overlapping"` only when you want to aggressively remove all text within overlapping duplicate regions and accept potential over-deletion. The "overlapping" strategy is more aggressive and removes more text.

The Insight (Rule of Thumb)

Action: Use the default `merge_strategy="longest"` for safe deduplication.
Value: `"longest"` removes only individually confirmed duplicate spans; `"overlapping"` merges them.
Trade-off: `"longest"` may leave some redundant text but avoids removing non-duplicate content. `"overlapping"` removes more text but may over-delete.
Example: Given duplicates `[2,4]` and `[3,5]`:
- `"longest"` keeps both as separate spans
- `"overlapping"` merges to `[2,5]`

Reasoning

The key insight is documented in the source: "when [2, 4] and [3, 5] are duplicates, [2, 5] might be not." Two overlapping duplicate spans may each independently occur elsewhere in the corpus, but the combined span may be unique to this document. Merging them into `[2, 5]` and deleting the entire range could remove text that should have been preserved.

The `"longest"` strategy works by sorting intervals and then only skipping substrings that are fully contained within a previously seen interval. The `"overlapping"` strategy merges any intervals that touch or overlap.

Code Evidence

Strategy documentation from `src/text_dedup/config/algorithms/suffix_array.py:21-36`:

merge_strategy: Literal["longest", "overlapping"] = "longest"

@staticmethod
def merge_intervals(
    intervals: list[slice],
    merge_strategy: Literal["longest", "overlapping"] = "longest",
) -> list[slice]:
    """
    Merge overlapping intervals.

    Parameters
    ----------
    merge_strategy : Literal["longest", "overlapping"]
        Strategy to merge intervals, by default "longest"
        "overlapping": merge overlapping intervals
        "longest": only ignore duplicate substrings, this is useful because
        when [2, 4] and [3, 5] are duplicates, [2, 5] might be not
    """

The "longest" strategy logic from `src/text_dedup/config/algorithms/suffix_array.py:107-111`:

elif merge_strategy == "longest":
    if current.stop <= prev.stop:  # ignore substrings
        continue
    else:
        merged.append(current)

The "overlapping" strategy logic from `src/text_dedup/config/algorithms/suffix_array.py:102-106`:

if merge_strategy == "overlapping":
    if prev.stop >= current.start:
        merged[-1] = slice(prev.start, max(prev.stop, current.stop))
    else:
        merged.append(current)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment