Principle:Google research Deduplicate text datasets Dataset Serialization HF

Knowledge Sources	Deduplicating Training Data Makes Language Models Better HuggingFace Datasets deduplicate-text-datasets
Domains	Data_Processing, NLP, Text_Deduplication
Last Updated	2026-02-14 21:00 GMT

Overview

A data transformation technique that converts a HuggingFace dataset into a flat binary file with separator-delimited examples, enabling byte-level suffix array construction and deduplication.

Description

HuggingFace dataset serialization serves the same role as the TFDS serialization variant but targets datasets from the HuggingFace Hub ecosystem. It uses the datasets library to load datasets (including local text, JSON, and CSV files) and serializes them into the same flat binary format with \xff\xff + UID separators.

The key difference from the TFDS variant is the data loading backend and additional flexibility: it supports a configurable text feature key (default "text"), local file loading via a file extension mapping (text→.txt, json→.jsonl, csv→.csv), subset selection, and configurable parallel tokenization workers. Tokenization uses only the GPT-2 tokenizer via a batched dataset.map() call for efficiency.

Usage

Use this technique as the first step when deduplicating a HuggingFace Hub dataset or local text/JSON/CSV files. It is an alternative to the TFDS variant for datasets not available through TensorFlow Datasets.

Theoretical Basis

The serialization follows the same separator-based scheme as the TFDS variant:

# Abstract serialization (NOT real implementation)
def serialize_hf_dataset(dataset, text_key="text"):
    output = bytearray()
    offsets = [0]
    uid = 0

    for example in dataset:
        uid += 1
        separator = b"\xff\xff" + pack_uint32_le(uid)
        content = example[text_key].encode("utf-8")
        segment = separator + content
        output.extend(segment)
        offsets.append(len(output))

    return output, offsets

Local file support: When the dataset name matches a known file extension type (text, json, csv), the script loads local files from the specified data directory using glob patterns (e.g., /data/*.txt). This enables deduplication of arbitrary text corpora without uploading to the Hub.

Batched tokenization: When tokenization is enabled, the script uses dataset.map() with batched=True and configurable num_proc workers, which is more efficient than per-example tokenization.

Related Pages

Implemented By

Implementation:Google_research_Deduplicate_text_datasets_Load_Dataset_HF

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment