Principle:Google research Deduplicate text datasets Dataset Serialization TFDS

Knowledge Sources	Deduplicating Training Data Makes Language Models Better TensorFlow Datasets deduplicate-text-datasets
Domains	Data_Processing, NLP, Text_Deduplication
Last Updated	2026-02-14 21:00 GMT

Overview

A data transformation technique that converts a structured dataset into a flat binary file with separator-delimited examples, enabling byte-level suffix array construction and deduplication.

Description

Dataset serialization solves the fundamental format mismatch between structured dataset formats (such as TensorFlow Datasets) and the byte-level suffix array algorithms. Suffix arrays operate on raw byte sequences, so datasets must first be serialized into a single contiguous binary file.

The serialization process concatenates all text examples into one flat binary file, inserting unique separator tokens between examples. Each separator consists of a configurable prefix (default \xff\xff) followed by a 4-byte incrementing UID. This separator scheme serves two purposes: (1) it prevents the suffix array from finding spurious matches that span example boundaries, and (2) it enables reverse-mapping byte positions back to their originating examples during the deduplication application step.

Alongside the binary file, a size file is written containing cumulative byte offsets for each example boundary, stored as a uint64 array. This size file is essential for the TFDS deduplication application step, which uses it to map byte-range removals back to individual examples.

Optional tokenization support converts text to GPT-2 or T5 token IDs (stored as uint16 packed into bytes), enabling token-level rather than character-level deduplication.

Usage

Use this technique as the first step when deduplicating a TensorFlow Dataset (TFDS). It is required before building a suffix array. For HuggingFace datasets, use the HuggingFace dataset serialization variant instead.

Theoretical Basis

The serialization creates a bijection between the structured dataset and a flat byte sequence:

# Abstract serialization scheme (NOT real implementation)
def serialize_dataset(dataset, separator_prefix=b"\xff\xff"):
    output = bytearray()
    offsets = [0]
    uid = 0

    for example in dataset:
        uid += 1
        separator = separator_prefix + pack_uint32_le(uid)
        content = encode(example["text"])  # UTF-8 or tokenized
        segment = separator + content
        output.extend(segment)
        offsets.append(len(output))

    return output, offsets

Separator design rationale:

\xff\xff is chosen because it is an invalid UTF-8 sequence, so it cannot appear in well-formed text data.
The 4-byte UID ensures each separator is unique, preventing the suffix array from matching separators themselves as duplicates.
The combination creates a 6-byte unique marker that reliably delimits examples without false matches.

Size file format: A flat array of uint64 values where sizes[i] is the cumulative byte offset at the start of example i. This enables O(1) lookup of which example a given byte position belongs to.

Related Pages

Implemented By

Implementation:Google_research_Deduplicate_text_datasets_Load_Dataset_TFDS

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment