Principle:Google research Deduplicate text datasets TFDS Deduplication Application

Knowledge Sources	Deduplicating Training Data Makes Language Models Better TensorFlow Datasets deduplicate-text-datasets
Domains	Text_Deduplication, Data_Processing, NLP
Last Updated	2026-02-14 21:00 GMT

Overview

A data transformation technique that maps byte-level duplicate ranges back to structured dataset examples, applies substring removal within each example, and rebuilds a clean TensorFlow Dataset.

Description

TFDS deduplication application solves the problem of applying byte-level deduplication results back to a structured dataset that has an example-level schema. While the suffix array operates on a flat binary representation, the original dataset consists of discrete examples with metadata fields (text, URL, timestamp, etc.). This technique bridges the gap by:

Using the .size offset file (from the serialization step) to map each byte range to the example it belongs to.
Computing per-example sub-ranges by adjusting byte offsets relative to each example's start position.
Applying substring removal within each example's text field while preserving all other metadata.
Rebuilding a valid TFDS dataset using a custom GeneratorBasedBuilder that produces cleaned examples.
Restructuring the output directory to match the expected TFDS layout (e.g., wiki40b/en/1.3.0/).

This is more complex than flat-file byte removal because the output must be a valid TFDS dataset that can be loaded with tfds.load().

Usage

Use this technique as the final step when deduplicating a TFDS dataset (specifically Wiki40B as demonstrated in the paper). For flat binary files that do not need to maintain a TFDS schema, use byte range removal instead.

Theoretical Basis

The mapping from byte ranges to per-example ranges uses the cumulative size array:

# Abstract algorithm (NOT real implementation)
def map_byte_ranges_to_examples(byte_ranges, sizes):
    """Map global byte ranges to per-example (start, end) pairs."""
    example_ranges = defaultdict(list)

    ptr = 0  # pointer into byte_ranges
    for i, byte_start in enumerate(sizes[:-1]):
        byte_end = sizes[i + 1]

        while ptr < len(byte_ranges) and byte_start <= byte_ranges[ptr][0] < byte_end:
            # Adjust to example-relative offsets, accounting for 6-byte separator
            local_start = max(byte_ranges[ptr][0] - byte_start - 6, 0)
            local_end = min(byte_ranges[ptr][1] - byte_start, byte_end - byte_start)
            example_ranges[i].append((local_start, local_end))
            ptr += 1

    return example_ranges

def apply_dedup_to_example(text, ranges):
    """Remove duplicate substrings from a single example."""
    for start, end in reversed(ranges):
        text = text[:start] + text[end:]
    return text

Key details:

The 6-byte separator offset (\xff\xff + 4-byte UID) must be subtracted when mapping global byte positions to example-local positions.
Ranges are applied in reverse order within each example to avoid invalidating later offsets.
The algorithm uses multiprocessing for parallel example processing.
The output is restructured into the standard TFDS directory layout with merged dataset_info.json.

Related Pages

Implemented By

Implementation:Google_research_Deduplicate_text_datasets_Finish_Dedup_Wiki40b

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment