Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:ChenghaoMou Text dedup SA Restore And Merge

From Leeroopedia
Knowledge Sources
Domains Deduplication, Text_Processing
Last Updated 2026-02-14 21:00 GMT

Overview

Concrete tool for restoring global byte ranges to per-document coordinates, merging overlapping intervals, and cleaning duplicate substrings provided by text-dedup.

Description

The restore_and_merge static method on SuffixArrayAlgorithmConfig orchestrates three operations: (1) restore() maps global byte ranges to per-document (index, local_slice) pairs using document boundary offsets, (2) merge_intervals() combines overlapping or contained intervals per document according to the merge strategy, and (3) the total duplicate byte count is computed. The clean_up static method then removes the identified byte ranges from each document's text using bytearray slicing.

These are all static methods on SuffixArrayAlgorithmConfig and use pure Python with collections.deque for efficient slice processing.

Usage

Use this after suffix array self-similarity detection to finalize deduplication by restoring and removing duplicate substrings.

Code Reference

Source Location

  • Repository: text-dedup
  • File: src/text_dedup/config/algorithms/suffix_array.py
  • Lines: L21-113 (merge_intervals), L115-189 (restore), L190-241 (restore_and_merge), L260-290 (clean_up)

Signature

class SuffixArrayAlgorithmConfig(AlgorithmConfig):
    @staticmethod
    def restore_and_merge(
        boundaries: list[slice],
        segments: str | Path | list[slice],
        k: int,
        merge_strategy: Literal["longest", "overlapping"] = "longest",
    ) -> tuple[list[list[slice]], int]:
        """Restore duplicate byte ranges to per-document coordinates and merge.

        Parameters
        ----------
        boundaries : list[slice]
            Per-document byte boundary offsets in the concatenated file.
        segments : str | Path | list[slice]
            Path to file with duplicate byte ranges, or list of slices.
        k : int
            Minimum duplicate substring byte length.
        merge_strategy : Literal["longest", "overlapping"]
            How to merge overlapping ranges.

        Returns
        -------
        tuple[list[list[slice]], int]
            Per-document list of byte ranges to remove, and total duplicate bytes.
        """

    @staticmethod
    def clean_up(text: str, slices: list[slice]) -> str:
        """Remove duplicate substrings from text.

        Parameters
        ----------
        text : str
            Document text.
        slices : list[slice]
            Byte ranges to remove.

        Returns
        -------
        str
            Text with duplicate substrings removed.
        """

Import

from text_dedup.config import SuffixArrayAlgorithmConfig

I/O Contract

Inputs

Name Type Required Description
boundaries list[slice] Yes Per-document byte offsets in concatenated file
segments str or Path or list[slice] Yes Duplicate byte ranges file or list
k int Yes Minimum duplicate substring byte length
merge_strategy str No "longest" (default) or "overlapping"

Outputs

Name Type Description
duplicate_slices list[list[slice]] Per-document list of byte ranges to remove
duplicate_size int Total bytes marked as duplicate

Usage Examples

Restoring and Merging Duplicate Ranges

from text_dedup.config import SuffixArrayAlgorithmConfig

# Document boundaries in concatenated file
offsets = [slice(0, 100), slice(100, 250), slice(250, 400)]

# Duplicate byte ranges from suffix array detection
duplicate_slices, total_bytes = SuffixArrayAlgorithmConfig.restore_and_merge(
    boundaries=offsets,
    segments="output/temp_output.txt",
    k=100,  # minimum 100 byte duplicates
    merge_strategy="longest",
)

# Clean each document
for idx, doc_text in enumerate(documents):
    cleaned = SuffixArrayAlgorithmConfig.clean_up(doc_text, duplicate_slices[idx])

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment