Principle:Google research Deduplicate text datasets Duplicate Range Collection

Knowledge Sources	Deduplicating Training Data Makes Language Models Better deduplicate-text-datasets
Domains	Text_Deduplication, Data_Processing
Last Updated	2026-02-14 21:00 GMT

Overview

An algorithm that merges scattered duplicate position/length pairs into contiguous, non-overlapping byte ranges suitable for removal from the original dataset.

Description

Duplicate range collection is the bridge between raw duplicate detection output and actionable data cleaning. The duplicate detection step (self-similar or across-similar) produces pairs of (position, length) for every detected duplicate across multiple thread partitions. These raw results may overlap, be out of order, and are split across multiple files.

The collection step consolidates all duplicate markers into a single ordered list of non-overlapping byte ranges. It uses a bit-level tracking structure (via the bitvec crate) to mark every byte that participates in a duplicate, then scans this bitvector to extract contiguous ranges. This ensures that overlapping duplicates are merged correctly and no byte is counted twice.

The output is a simple text format listing start/end byte pairs, which can be consumed by the downstream removal step (either finish_single_file.py for flat files or finish_dedup_wiki40b.py for TFDS datasets).

Usage

Use this technique after running either self-similar or across-similar duplicate detection. It is a required intermediate step that transforms raw duplicate pointers into a clean removal specification. The output is deterministic and produces the minimal set of byte ranges that cover all detected duplicates.

Theoretical Basis

The algorithm operates in two phases:

# Abstract algorithm (NOT real implementation)
def collect_ranges(data_file, cache_dir, length_threshold):
    # Phase 1: Build a bitvector marking every duplicate byte
    bitvec = [False] * len(data_file)

    for dup_file, size_file in read_cache_files(cache_dir):
        for position, length in zip(dup_file, size_file):
            if length >= length_threshold:
                for byte_idx in range(position, position + length):
                    bitvec[byte_idx] = True

    # Phase 2: Extract contiguous ranges from bitvector
    ranges = []
    in_range = False
    for i, is_dup in enumerate(bitvec):
        if is_dup and not in_range:
            start = i
            in_range = True
        elif not is_dup and in_range:
            ranges.append((start, i))
            in_range = False

    return ranges  # List of (start_byte, end_byte) pairs

Key properties:

Idempotent merging: Overlapping duplicates from different threads are automatically merged via the bitvector.
Length re-filtering: The length threshold is re-applied during collection, allowing different thresholds to be used without re-running detection.
Output format: A text stream with an out header line followed by start end pairs, one per line, in ascending order.

Related Pages

Implemented By

Implementation:Google_research_Deduplicate_text_datasets_Cmd_Collect

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment