Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Google research Deduplicate text datasets Dataset Serialization HF

From Leeroopedia
Revision as of 17:33, 16 February 2026 by Admin (talk | contribs) (Auto-imported from principles/Google_research_Deduplicate_text_datasets_Dataset_Serialization_HF.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains Data_Processing, NLP, Text_Deduplication
Last Updated 2026-02-14 21:00 GMT

Overview

A data transformation technique that converts a HuggingFace dataset into a flat binary file with separator-delimited examples, enabling byte-level suffix array construction and deduplication.

Description

HuggingFace dataset serialization serves the same role as the TFDS serialization variant but targets datasets from the HuggingFace Hub ecosystem. It uses the datasets library to load datasets (including local text, JSON, and CSV files) and serializes them into the same flat binary format with \xff\xff + UID separators.

The key difference from the TFDS variant is the data loading backend and additional flexibility: it supports a configurable text feature key (default "text"), local file loading via a file extension mapping (text.txt, json.jsonl, csv.csv), subset selection, and configurable parallel tokenization workers. Tokenization uses only the GPT-2 tokenizer via a batched dataset.map() call for efficiency.

Usage

Use this technique as the first step when deduplicating a HuggingFace Hub dataset or local text/JSON/CSV files. It is an alternative to the TFDS variant for datasets not available through TensorFlow Datasets.

Theoretical Basis

The serialization follows the same separator-based scheme as the TFDS variant:

# Abstract serialization (NOT real implementation)
def serialize_hf_dataset(dataset, text_key="text"):
    output = bytearray()
    offsets = [0]
    uid = 0

    for example in dataset:
        uid += 1
        separator = b"\xff\xff" + pack_uint32_le(uid)
        content = example[text_key].encode("utf-8")
        segment = separator + content
        output.extend(segment)
        offsets.append(len(output))

    return output, offsets

Local file support: When the dataset name matches a known file extension type (text, json, csv), the script loads local files from the specified data directory using glob patterns (e.g., /data/*.txt). This enables deduplication of arbitrary text corpora without uploading to the Hub.

Batched tokenization: When tokenization is enabled, the script uses dataset.map() with batched=True and configurable num_proc workers, which is more efficient than per-example tokenization.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment