Workflow:Google research Deduplicate text datasets Wiki40B TFDS deduplication
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, NLP, Deduplication |
| Last Updated | 2026-02-14 21:00 GMT |
Overview
End-to-end process for deduplicating a TensorFlow Datasets (TFDS) dataset such as Wiki40B, from loading through suffix array construction to producing a clean deduplicated TFDS output.
Description
This workflow implements the canonical deduplication pipeline described in the paper. It loads a dataset from TensorFlow Datasets, serializes all examples into a single flat binary file with separator tokens and unique IDs, builds a suffix array, identifies all repeated substrings, collects the duplicate byte ranges, and then maps those byte ranges back to individual examples to produce a new deduplicated TFDS dataset. The separator tokens prevent false cross-example matches in the suffix array.
Key characteristics:
- Full pipeline from TFDS input to TFDS output
- Separator-based serialization prevents cross-example suffix array matches
- Size file tracks byte offsets for mapping back to individual examples
- Output is a valid TFDS dataset that can be loaded normally
- Optional GPT-2 or T5 tokenization to reduce dataset size by ~2x
- Supports iterative deduplication (run twice to catch newly-created duplicates)
Usage
Execute this workflow when you have a TensorFlow Datasets dataset (particularly Wiki40B) and need to produce a deduplicated version as a valid TFDS dataset. This is the primary workflow demonstrated in the paper and is suitable for datasets that are loaded via the TFDS API. For HuggingFace datasets, substitute the loading step with the HuggingFace loader. For arbitrary single files, use the Single File Deduplication workflow instead.
Execution Steps
Step 1: Load and serialize dataset
Load the dataset from TensorFlow Datasets and serialize all text examples into a single flat binary file. Each example is prefixed with a two-byte separator marker (0xFF 0xFF) followed by a 4-byte unique identifier to prevent the suffix array from finding spurious cross-example matches. A companion size file records the byte offset where each example begins.
Key considerations:
- The separator/UID scheme ensures the suffix array never matches across example boundaries
- Optional tokenization (GPT-2 or T5) can halve the file size, but the length threshold must be doubled to compensate
- The size file is essential for the final step to map byte ranges back to individual examples
- Uses multiprocessing for parallel tokenization when enabled
Step 2: Build suffix array
Construct a suffix array for the serialized binary file using parallel chunked construction. The file is split into overlapping chunks, suffix arrays are built independently for each chunk, and the partial arrays are merged into a single complete suffix array.
Key considerations:
- The number of parallel jobs scales automatically with file size
- Chunks overlap by 100,000 bytes to handle matches spanning chunk boundaries
- For large files, increase the open file limit with ulimit before running
- Variable-width pointers minimize disk usage based on file size
Step 3: Find self-similar duplicates
Scan the suffix array to identify all substrings of at least the configured length threshold that appear more than once anywhere in the dataset. The algorithm streams through the suffix array in parallel, comparing adjacent entries to find clusters of identical substrings.
Key considerations:
- The paper uses a length threshold of 100 bytes (50 tokens) for untokenized text
- Output is written to a cache directory as dups (byte offsets) and sizes (cluster sizes) files
- The number of threads should match available CPU cores for optimal performance
Step 4: Collect duplicate ranges
Merge the individual duplicate pointers into consolidated byte ranges [a, b) representing contiguous regions where every sub-window of threshold length is duplicated. Overlapping and adjacent regions are combined.
Key considerations:
- Output is a text file of (start_byte, end_byte) pairs, one per line
- This step reduces millions of individual duplicate pointers to a manageable list of byte ranges
Step 5: Apply deduplication to TFDS dataset
Map the byte ranges back to individual dataset examples using the size file from Step 1. For each affected example, excise the duplicate byte ranges from its text content. Rebuild the dataset as a valid TFDS dataset with the same schema as the original, ready for normal use.
Key considerations:
- The size file maps global byte offsets back to per-example local offsets
- A 6-byte adjustment accounts for the separator/UID prefix per example
- The output TFDS dataset preserves the original schema (text, version_id, wikidata_id, etc.)
- Currently only Wiki40B output formatting is implemented; other datasets require custom finish scripts
- Running the full pipeline again on the output typically reduces remaining duplicates to near zero