Principle:Google research Deduplicate text datasets Byte Range Removal

Knowledge Sources	Deduplicating Training Data Makes Language Models Better deduplicate-text-datasets
Domains	Text_Deduplication, Data_Processing
Last Updated	2026-02-14 21:00 GMT

Overview

A data transformation technique that excises specified byte ranges from a flat binary file to produce a deduplicated output, preserving all non-duplicate content in its original order.

Description

Byte range removal is the final step in the flat-file deduplication pipeline. Given an original binary file and a list of byte ranges to remove (produced by the collect step), it outputs a new file containing only the non-duplicate portions of the original. The algorithm reads the removal file in reverse order (highest offsets first), then processes the original file sequentially, copying non-removed regions and skipping removed ones.

This approach operates at the raw byte level, making it format-agnostic. It works on any flat binary file regardless of the text encoding, tokenization, or separator scheme used during serialization. The only constraint is that the removal file must contain valid, non-overlapping byte ranges in ascending order (as produced by the collect step).

Usage

Use this technique as the final step in single-file deduplication or cross-dataset deduplication when working with flat binary files. For TFDS-structured datasets that need to maintain their example-level schema, use the TFDS deduplication application technique instead.

Theoretical Basis

The algorithm is a simple linear scan with skip regions:

# Abstract algorithm (NOT real implementation)
def remove_byte_ranges(original_file, ranges, output_file):
    # ranges = [(start1, end1), (start2, end2), ...] in ascending order
    # Process from end to start (stack order) for efficient popping
    ranges = reversed(ranges)

    current_pos = 0
    for start, end in ranges:
        # Copy everything from current position up to the start of this range
        output_file.write(original_file.read(start - current_pos))
        # Skip the duplicate range
        original_file.seek(end)
        current_pos = end

    # Copy remaining content after the last range
    output_file.write(original_file.read())

Key properties:

O(N) time complexity: The file is read exactly once in a single pass.
O(1) memory: Only the removal list and current read position need to be in memory; data is streamed.
Order preserving: Non-duplicate content appears in the output in the same order as the original.
Format agnostic: Works on any binary content without needing to understand the data format.

Related Pages

Implemented By

Implementation:Google_research_Deduplicate_text_datasets_Finish_Single_File

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment