Principle:ChenghaoMou Text dedup Substring Deduplication
| Knowledge Sources | |
|---|---|
| Domains | Deduplication, NLP, Text_Processing |
| Last Updated | 2026-02-14 21:00 GMT |
Overview
A technique that maps global byte-level duplicate ranges back to per-document offsets, merges overlapping intervals, and removes duplicate substrings from individual documents.
Description
After suffix array construction identifies duplicate byte ranges in the concatenated corpus, those ranges must be (1) restored to per-document coordinates by mapping global byte offsets to document boundaries, (2) merged to handle overlapping duplicate ranges within the same document, and (3) removed by excising the duplicate byte spans from each document's text.
Two merge strategies are supported:
- overlapping: merge any overlapping or adjacent intervals into one (aggressive — removes more text)
- longest: keep all non-contained intervals, only removing substrings that are fully contained within a longer duplicate (conservative — preserves more unique text)
Usage
Use this principle after suffix array self-similarity detection to finalize substring-level deduplication. This is the post-processing stage that converts raw byte ranges into cleaned documents.
Theoretical Basis
Interval merging for the "overlapping" strategy:
# Abstract interval merging (NOT real implementation)
sorted_intervals = sort_by_start(intervals)
merged = [sorted_intervals[0]]
for interval in sorted_intervals[1:]:
if interval.start <= merged[-1].stop:
merged[-1] = (merged[-1].start, max(merged[-1].stop, interval.stop))
else:
merged.append(interval)
The "longest" strategy only removes intervals that are fully contained within another.
Text cleaning excises duplicate byte ranges:
# Abstract text cleaning (NOT real implementation)
result = bytearray()
pos = 0
for slice in duplicate_slices:
result.extend(text_bytes[pos:slice.start])
pos = slice.stop
result.extend(text_bytes[pos:])