Implementation:ChenghaoMou Text dedup SA Restore And Merge

Knowledge Sources	text-dedup
Domains	Deduplication, Text_Processing
Last Updated	2026-02-14 21:00 GMT

Overview

Concrete tool for restoring global byte ranges to per-document coordinates, merging overlapping intervals, and cleaning duplicate substrings provided by text-dedup.

Description

The restore_and_merge static method on SuffixArrayAlgorithmConfig orchestrates three operations: (1) restore() maps global byte ranges to per-document (index, local_slice) pairs using document boundary offsets, (2) merge_intervals() combines overlapping or contained intervals per document according to the merge strategy, and (3) the total duplicate byte count is computed. The clean_up static method then removes the identified byte ranges from each document's text using bytearray slicing.

These are all static methods on SuffixArrayAlgorithmConfig and use pure Python with collections.deque for efficient slice processing.

Usage

Use this after suffix array self-similarity detection to finalize deduplication by restoring and removing duplicate substrings.

Code Reference

Source Location

Repository: text-dedup
File: src/text_dedup/config/algorithms/suffix_array.py
Lines: L21-113 (merge_intervals), L115-189 (restore), L190-241 (restore_and_merge), L260-290 (clean_up)

Signature

class SuffixArrayAlgorithmConfig(AlgorithmConfig):
    @staticmethod
    def restore_and_merge(
        boundaries: list[slice],
        segments: str | Path | list[slice],
        k: int,
        merge_strategy: Literal["longest", "overlapping"] = "longest",
    ) -> tuple[list[list[slice]], int]:
        """Restore duplicate byte ranges to per-document coordinates and merge.

        Parameters
        ----------
        boundaries : list[slice]
            Per-document byte boundary offsets in the concatenated file.
        segments : str | Path | list[slice]
            Path to file with duplicate byte ranges, or list of slices.
        k : int
            Minimum duplicate substring byte length.
        merge_strategy : Literal["longest", "overlapping"]
            How to merge overlapping ranges.

        Returns
        -------
        tuple[list[list[slice]], int]
            Per-document list of byte ranges to remove, and total duplicate bytes.
        """

    @staticmethod
    def clean_up(text: str, slices: list[slice]) -> str:
        """Remove duplicate substrings from text.

        Parameters
        ----------
        text : str
            Document text.
        slices : list[slice]
            Byte ranges to remove.

        Returns
        -------
        str
            Text with duplicate substrings removed.
        """

Import

from text_dedup.config import SuffixArrayAlgorithmConfig

I/O Contract

Inputs

Name	Type	Required	Description
boundaries	list[slice]	Yes	Per-document byte offsets in concatenated file
segments	str or Path or list[slice]	Yes	Duplicate byte ranges file or list
k	int	Yes	Minimum duplicate substring byte length
merge_strategy	str	No	"longest" (default) or "overlapping"

Outputs

Name	Type	Description
duplicate_slices	list[list[slice]]	Per-document list of byte ranges to remove
duplicate_size	int	Total bytes marked as duplicate

Usage Examples

Restoring and Merging Duplicate Ranges

from text_dedup.config import SuffixArrayAlgorithmConfig

# Document boundaries in concatenated file
offsets = [slice(0, 100), slice(100, 250), slice(250, 400)]

# Duplicate byte ranges from suffix array detection
duplicate_slices, total_bytes = SuffixArrayAlgorithmConfig.restore_and_merge(
    boundaries=offsets,
    segments="output/temp_output.txt",
    k=100,  # minimum 100 byte duplicates
    merge_strategy="longest",
)

# Clean each document
for idx, doc_text in enumerate(documents):
    cleaned = SuffixArrayAlgorithmConfig.clean_up(doc_text, duplicate_slices[idx])

Related Pages

Implements Principle

Principle:ChenghaoMou_Text_dedup_Substring_Deduplication

Requires Environment

Uses Heuristic

Heuristic:ChenghaoMou_Text_dedup_Suffix_Array_Merge_Strategy

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment