Principle:ChenghaoMou Text dedup Substring Deduplication

Knowledge Sources	Deduplicating Training Data Makes Language Models Better text-dedup
Domains	Deduplication, NLP, Text_Processing
Last Updated	2026-02-14 21:00 GMT

Overview

A technique that maps global byte-level duplicate ranges back to per-document offsets, merges overlapping intervals, and removes duplicate substrings from individual documents.

Description

After suffix array construction identifies duplicate byte ranges in the concatenated corpus, those ranges must be (1) restored to per-document coordinates by mapping global byte offsets to document boundaries, (2) merged to handle overlapping duplicate ranges within the same document, and (3) removed by excising the duplicate byte spans from each document's text.

Two merge strategies are supported:

overlapping: merge any overlapping or adjacent intervals into one (aggressive — removes more text)
longest: keep all non-contained intervals, only removing substrings that are fully contained within a longer duplicate (conservative — preserves more unique text)

Usage

Use this principle after suffix array self-similarity detection to finalize substring-level deduplication. This is the post-processing stage that converts raw byte ranges into cleaned documents.

Theoretical Basis

Interval merging for the "overlapping" strategy:

# Abstract interval merging (NOT real implementation)
sorted_intervals = sort_by_start(intervals)
merged = [sorted_intervals[0]]
for interval in sorted_intervals[1:]:
    if interval.start <= merged[-1].stop:
        merged[-1] = (merged[-1].start, max(merged[-1].stop, interval.stop))
    else:
        merged.append(interval)

The "longest" strategy only removes intervals that are fully contained within another.

Text cleaning excises duplicate byte ranges:

# Abstract text cleaning (NOT real implementation)
result = bytearray()
pos = 0
for slice in duplicate_slices:
    result.extend(text_bytes[pos:slice.start])
    pos = slice.stop
result.extend(text_bytes[pos:])

Related Pages

Implemented By

Implementation:ChenghaoMou_Text_dedup_SA_Restore_And_Merge

Uses Heuristic

Heuristic:ChenghaoMou_Text_dedup_Suffix_Array_Merge_Strategy

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment