Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Google research Deduplicate text datasets TFDS Deduplication Application

From Leeroopedia
Knowledge Sources
Domains Text_Deduplication, Data_Processing, NLP
Last Updated 2026-02-14 21:00 GMT

Overview

A data transformation technique that maps byte-level duplicate ranges back to structured dataset examples, applies substring removal within each example, and rebuilds a clean TensorFlow Dataset.

Description

TFDS deduplication application solves the problem of applying byte-level deduplication results back to a structured dataset that has an example-level schema. While the suffix array operates on a flat binary representation, the original dataset consists of discrete examples with metadata fields (text, URL, timestamp, etc.). This technique bridges the gap by:

  1. Using the .size offset file (from the serialization step) to map each byte range to the example it belongs to.
  2. Computing per-example sub-ranges by adjusting byte offsets relative to each example's start position.
  3. Applying substring removal within each example's text field while preserving all other metadata.
  4. Rebuilding a valid TFDS dataset using a custom GeneratorBasedBuilder that produces cleaned examples.
  5. Restructuring the output directory to match the expected TFDS layout (e.g., wiki40b/en/1.3.0/).

This is more complex than flat-file byte removal because the output must be a valid TFDS dataset that can be loaded with tfds.load().

Usage

Use this technique as the final step when deduplicating a TFDS dataset (specifically Wiki40B as demonstrated in the paper). For flat binary files that do not need to maintain a TFDS schema, use byte range removal instead.

Theoretical Basis

The mapping from byte ranges to per-example ranges uses the cumulative size array:

# Abstract algorithm (NOT real implementation)
def map_byte_ranges_to_examples(byte_ranges, sizes):
    """Map global byte ranges to per-example (start, end) pairs."""
    example_ranges = defaultdict(list)

    ptr = 0  # pointer into byte_ranges
    for i, byte_start in enumerate(sizes[:-1]):
        byte_end = sizes[i + 1]

        while ptr < len(byte_ranges) and byte_start <= byte_ranges[ptr][0] < byte_end:
            # Adjust to example-relative offsets, accounting for 6-byte separator
            local_start = max(byte_ranges[ptr][0] - byte_start - 6, 0)
            local_end = min(byte_ranges[ptr][1] - byte_start, byte_end - byte_start)
            example_ranges[i].append((local_start, local_end))
            ptr += 1

    return example_ranges

def apply_dedup_to_example(text, ranges):
    """Remove duplicate substrings from a single example."""
    for start, end in reversed(ranges):
        text = text[:start] + text[end:]
    return text

Key details:

  • The 6-byte separator offset (\xff\xff + 4-byte UID) must be subtracted when mapping global byte positions to example-local positions.
  • Ranges are applied in reverse order within each example to avoid invalidating later offsets.
  • The algorithm uses multiprocessing for parallel example processing.
  • The output is restructured into the standard TFDS directory layout with merged dataset_info.json.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment