Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:ChenghaoMou Text dedup Substring Deduplication

From Leeroopedia
Knowledge Sources
Domains Deduplication, NLP, Text_Processing
Last Updated 2026-02-14 21:00 GMT

Overview

A technique that maps global byte-level duplicate ranges back to per-document offsets, merges overlapping intervals, and removes duplicate substrings from individual documents.

Description

After suffix array construction identifies duplicate byte ranges in the concatenated corpus, those ranges must be (1) restored to per-document coordinates by mapping global byte offsets to document boundaries, (2) merged to handle overlapping duplicate ranges within the same document, and (3) removed by excising the duplicate byte spans from each document's text.

Two merge strategies are supported:

  • overlapping: merge any overlapping or adjacent intervals into one (aggressive — removes more text)
  • longest: keep all non-contained intervals, only removing substrings that are fully contained within a longer duplicate (conservative — preserves more unique text)

Usage

Use this principle after suffix array self-similarity detection to finalize substring-level deduplication. This is the post-processing stage that converts raw byte ranges into cleaned documents.

Theoretical Basis

Interval merging for the "overlapping" strategy:

# Abstract interval merging (NOT real implementation)
sorted_intervals = sort_by_start(intervals)
merged = [sorted_intervals[0]]
for interval in sorted_intervals[1:]:
    if interval.start <= merged[-1].stop:
        merged[-1] = (merged[-1].start, max(merged[-1].stop, interval.stop))
    else:
        merged.append(interval)

The "longest" strategy only removes intervals that are fully contained within another.

Text cleaning excises duplicate byte ranges:

# Abstract text cleaning (NOT real implementation)
result = bytearray()
pos = 0
for slice in duplicate_slices:
    result.extend(text_bytes[pos:slice.start])
    pos = slice.stop
result.extend(text_bytes[pos:])

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment