Principle:Google research Deduplicate text datasets Dataset Serialization TFDS
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, NLP, Text_Deduplication |
| Last Updated | 2026-02-14 21:00 GMT |
Overview
A data transformation technique that converts a structured dataset into a flat binary file with separator-delimited examples, enabling byte-level suffix array construction and deduplication.
Description
Dataset serialization solves the fundamental format mismatch between structured dataset formats (such as TensorFlow Datasets) and the byte-level suffix array algorithms. Suffix arrays operate on raw byte sequences, so datasets must first be serialized into a single contiguous binary file.
The serialization process concatenates all text examples into one flat binary file, inserting unique separator tokens between examples. Each separator consists of a configurable prefix (default \xff\xff) followed by a 4-byte incrementing UID. This separator scheme serves two purposes: (1) it prevents the suffix array from finding spurious matches that span example boundaries, and (2) it enables reverse-mapping byte positions back to their originating examples during the deduplication application step.
Alongside the binary file, a size file is written containing cumulative byte offsets for each example boundary, stored as a uint64 array. This size file is essential for the TFDS deduplication application step, which uses it to map byte-range removals back to individual examples.
Optional tokenization support converts text to GPT-2 or T5 token IDs (stored as uint16 packed into bytes), enabling token-level rather than character-level deduplication.
Usage
Use this technique as the first step when deduplicating a TensorFlow Dataset (TFDS). It is required before building a suffix array. For HuggingFace datasets, use the HuggingFace dataset serialization variant instead.
Theoretical Basis
The serialization creates a bijection between the structured dataset and a flat byte sequence:
# Abstract serialization scheme (NOT real implementation)
def serialize_dataset(dataset, separator_prefix=b"\xff\xff"):
output = bytearray()
offsets = [0]
uid = 0
for example in dataset:
uid += 1
separator = separator_prefix + pack_uint32_le(uid)
content = encode(example["text"]) # UTF-8 or tokenized
segment = separator + content
output.extend(segment)
offsets.append(len(output))
return output, offsets
Separator design rationale:
\xff\xffis chosen because it is an invalid UTF-8 sequence, so it cannot appear in well-formed text data.- The 4-byte UID ensures each separator is unique, preventing the suffix array from matching separators themselves as duplicates.
- The combination creates a 6-byte unique marker that reliably delimits examples without false matches.
Size file format: A flat array of uint64 values where sizes[i] is the cumulative byte offset at the start of example i. This enables O(1) lookup of which example a given byte position belongs to.