Implementation:ChenghaoMou Text dedup SA Restore And Merge
| Knowledge Sources | |
|---|---|
| Domains | Deduplication, Text_Processing |
| Last Updated | 2026-02-14 21:00 GMT |
Overview
Concrete tool for restoring global byte ranges to per-document coordinates, merging overlapping intervals, and cleaning duplicate substrings provided by text-dedup.
Description
The restore_and_merge static method on SuffixArrayAlgorithmConfig orchestrates three operations: (1) restore() maps global byte ranges to per-document (index, local_slice) pairs using document boundary offsets, (2) merge_intervals() combines overlapping or contained intervals per document according to the merge strategy, and (3) the total duplicate byte count is computed. The clean_up static method then removes the identified byte ranges from each document's text using bytearray slicing.
These are all static methods on SuffixArrayAlgorithmConfig and use pure Python with collections.deque for efficient slice processing.
Usage
Use this after suffix array self-similarity detection to finalize deduplication by restoring and removing duplicate substrings.
Code Reference
Source Location
- Repository: text-dedup
- File: src/text_dedup/config/algorithms/suffix_array.py
- Lines: L21-113 (merge_intervals), L115-189 (restore), L190-241 (restore_and_merge), L260-290 (clean_up)
Signature
class SuffixArrayAlgorithmConfig(AlgorithmConfig):
@staticmethod
def restore_and_merge(
boundaries: list[slice],
segments: str | Path | list[slice],
k: int,
merge_strategy: Literal["longest", "overlapping"] = "longest",
) -> tuple[list[list[slice]], int]:
"""Restore duplicate byte ranges to per-document coordinates and merge.
Parameters
----------
boundaries : list[slice]
Per-document byte boundary offsets in the concatenated file.
segments : str | Path | list[slice]
Path to file with duplicate byte ranges, or list of slices.
k : int
Minimum duplicate substring byte length.
merge_strategy : Literal["longest", "overlapping"]
How to merge overlapping ranges.
Returns
-------
tuple[list[list[slice]], int]
Per-document list of byte ranges to remove, and total duplicate bytes.
"""
@staticmethod
def clean_up(text: str, slices: list[slice]) -> str:
"""Remove duplicate substrings from text.
Parameters
----------
text : str
Document text.
slices : list[slice]
Byte ranges to remove.
Returns
-------
str
Text with duplicate substrings removed.
"""
Import
from text_dedup.config import SuffixArrayAlgorithmConfig
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| boundaries | list[slice] | Yes | Per-document byte offsets in concatenated file |
| segments | str or Path or list[slice] | Yes | Duplicate byte ranges file or list |
| k | int | Yes | Minimum duplicate substring byte length |
| merge_strategy | str | No | "longest" (default) or "overlapping" |
Outputs
| Name | Type | Description |
|---|---|---|
| duplicate_slices | list[list[slice]] | Per-document list of byte ranges to remove |
| duplicate_size | int | Total bytes marked as duplicate |
Usage Examples
Restoring and Merging Duplicate Ranges
from text_dedup.config import SuffixArrayAlgorithmConfig
# Document boundaries in concatenated file
offsets = [slice(0, 100), slice(100, 250), slice(250, 400)]
# Duplicate byte ranges from suffix array detection
duplicate_slices, total_bytes = SuffixArrayAlgorithmConfig.restore_and_merge(
boundaries=offsets,
segments="output/temp_output.txt",
k=100, # minimum 100 byte duplicates
merge_strategy="longest",
)
# Clean each document
for idx, doc_text in enumerate(documents):
cleaned = SuffixArrayAlgorithmConfig.clean_up(doc_text, duplicate_slices[idx])
Related Pages
Implements Principle
Requires Environment
- Environment:ChenghaoMou_Text_dedup_Python_3_12_Environment
- Environment:ChenghaoMou_Text_dedup_Suffix_Array_External_Tools