Principle:Google research Deduplicate text datasets Dataset Serialization HF
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, NLP, Text_Deduplication |
| Last Updated | 2026-02-14 21:00 GMT |
Overview
A data transformation technique that converts a HuggingFace dataset into a flat binary file with separator-delimited examples, enabling byte-level suffix array construction and deduplication.
Description
HuggingFace dataset serialization serves the same role as the TFDS serialization variant but targets datasets from the HuggingFace Hub ecosystem. It uses the datasets library to load datasets (including local text, JSON, and CSV files) and serializes them into the same flat binary format with \xff\xff + UID separators.
The key difference from the TFDS variant is the data loading backend and additional flexibility: it supports a configurable text feature key (default "text"), local file loading via a file extension mapping (text→.txt, json→.jsonl, csv→.csv), subset selection, and configurable parallel tokenization workers. Tokenization uses only the GPT-2 tokenizer via a batched dataset.map() call for efficiency.
Usage
Use this technique as the first step when deduplicating a HuggingFace Hub dataset or local text/JSON/CSV files. It is an alternative to the TFDS variant for datasets not available through TensorFlow Datasets.
Theoretical Basis
The serialization follows the same separator-based scheme as the TFDS variant:
# Abstract serialization (NOT real implementation)
def serialize_hf_dataset(dataset, text_key="text"):
output = bytearray()
offsets = [0]
uid = 0
for example in dataset:
uid += 1
separator = b"\xff\xff" + pack_uint32_le(uid)
content = example[text_key].encode("utf-8")
segment = separator + content
output.extend(segment)
offsets.append(len(output))
return output, offsets
Local file support: When the dataset name matches a known file extension type (text, json, csv), the script loads local files from the specified data directory using glob patterns (e.g., /data/*.txt). This enables deduplication of arbitrary text corpora without uploading to the Hub.
Batched tokenization: When tokenization is enabled, the script uses dataset.map() with batched=True and configurable num_proc workers, which is more efficient than per-example tokenization.