Principle:Google research Deduplicate text datasets Duplicate Range Collection
| Knowledge Sources | |
|---|---|
| Domains | Text_Deduplication, Data_Processing |
| Last Updated | 2026-02-14 21:00 GMT |
Overview
An algorithm that merges scattered duplicate position/length pairs into contiguous, non-overlapping byte ranges suitable for removal from the original dataset.
Description
Duplicate range collection is the bridge between raw duplicate detection output and actionable data cleaning. The duplicate detection step (self-similar or across-similar) produces pairs of (position, length) for every detected duplicate across multiple thread partitions. These raw results may overlap, be out of order, and are split across multiple files.
The collection step consolidates all duplicate markers into a single ordered list of non-overlapping byte ranges. It uses a bit-level tracking structure (via the bitvec crate) to mark every byte that participates in a duplicate, then scans this bitvector to extract contiguous ranges. This ensures that overlapping duplicates are merged correctly and no byte is counted twice.
The output is a simple text format listing start/end byte pairs, which can be consumed by the downstream removal step (either finish_single_file.py for flat files or finish_dedup_wiki40b.py for TFDS datasets).
Usage
Use this technique after running either self-similar or across-similar duplicate detection. It is a required intermediate step that transforms raw duplicate pointers into a clean removal specification. The output is deterministic and produces the minimal set of byte ranges that cover all detected duplicates.
Theoretical Basis
The algorithm operates in two phases:
# Abstract algorithm (NOT real implementation)
def collect_ranges(data_file, cache_dir, length_threshold):
# Phase 1: Build a bitvector marking every duplicate byte
bitvec = [False] * len(data_file)
for dup_file, size_file in read_cache_files(cache_dir):
for position, length in zip(dup_file, size_file):
if length >= length_threshold:
for byte_idx in range(position, position + length):
bitvec[byte_idx] = True
# Phase 2: Extract contiguous ranges from bitvector
ranges = []
in_range = False
for i, is_dup in enumerate(bitvec):
if is_dup and not in_range:
start = i
in_range = True
elif not is_dup and in_range:
ranges.append((start, i))
in_range = False
return ranges # List of (start_byte, end_byte) pairs
Key properties:
- Idempotent merging: Overlapping duplicates from different threads are automatically merged via the bitvector.
- Length re-filtering: The length threshold is re-applied during collection, allowing different thresholds to be used without re-running detection.
- Output format: A text stream with an
outheader line followed bystart endpairs, one per line, in ascending order.