Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:ChenghaoMou Text dedup Substring Deduplication

From Leeroopedia
Knowledge Sources
Domains Deduplication, NLP, Text_Processing
Last Updated 2026-02-14 21:00 GMT

Overview

A technique that maps global byte-level duplicate ranges back to per-document offsets, merges overlapping intervals, and removes duplicate substrings from individual documents.

Description

After suffix array construction identifies duplicate byte ranges in the concatenated corpus, those ranges must be (1) restored to per-document coordinates by mapping global byte offsets to document boundaries, (2) merged to handle overlapping duplicate ranges within the same document, and (3) removed by excising the duplicate byte spans from each document's text.

Two merge strategies are supported:

  • overlapping: merge any overlapping or adjacent intervals into one (aggressive — removes more text)
  • longest: keep all non-contained intervals, only removing substrings that are fully contained within a longer duplicate (conservative — preserves more unique text)

Usage

Use this principle after suffix array self-similarity detection to finalize substring-level deduplication. This is the post-processing stage that converts raw byte ranges into cleaned documents.

Theoretical Basis

Interval merging for the "overlapping" strategy:

# Abstract interval merging (NOT real implementation)
sorted_intervals = sort_by_start(intervals)
merged = [sorted_intervals[0]]
for interval in sorted_intervals[1:]:
    if interval.start <= merged[-1].stop:
        merged[-1] = (merged[-1].start, max(merged[-1].stop, interval.stop))
    else:
        merged.append(interval)

The "longest" strategy only removes intervals that are fully contained within another.

Text cleaning excises duplicate byte ranges:

# Abstract text cleaning (NOT real implementation)
result = bytearray()
pos = 0
for slice in duplicate_slices:
    result.extend(text_bytes[pos:slice.start])
    pos = slice.stop
result.extend(text_bytes[pos:])

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment