Implementation:Google research Deduplicate text datasets Finish Dedup Wiki40b
| Knowledge Sources | |
|---|---|
| Domains | Text_Deduplication, Data_Processing, NLP |
| Last Updated | 2026-02-14 21:00 GMT |
Overview
Concrete tool for applying byte-level deduplication results back to a Wiki40B TensorFlow Dataset provided by the deduplicate-text-datasets repository.
Description
The finish_dedup_wiki40b.py script performs the final step of the Wiki40B deduplication pipeline. It maps global byte-range removals to per-example substring ranges using the .size offset file, applies those removals to each example's text field, and rebuilds a valid TFDS dataset using a custom MyDataset(tfds.core.GeneratorBasedBuilder). After generation, it restructures the output directory to match Wiki40B's expected TFDS layout (wiki40b/en/1.3.0/) and merges dataset_info.json metadata across splits.
The script uses multiprocessing to parallelize the per-example deduplication and processes the original dataset in batches of 2^16 examples.
Usage
Use this script as the final step in the Wiki40B TFDS deduplication workflow. It requires the original TFDS dataset, the .size file from the serialization step, and the byte-range removal file from the collect step. Note: this script is currently hardcoded for the Wiki40B dataset schema; other TFDS datasets would require modification.
Code Reference
Source Location
- Repository: deduplicate-text-datasets
- File: scripts/finish_dedup_wiki40b.py (L1-199)
Signature
python3 scripts/finish_dedup_wiki40b.py \
--data_dir <original_tfds_directory> \
--save_dir <output_directory> \
--name <dataset_name> \
--split <split_name> \
--suffixarray_dir <directory_containing_size_file> \
--remove <byte_range_removal_file>
Import
# CLI script, not importable as a library.
python3 scripts/finish_dedup_wiki40b.py --data_dir <dir> --save_dir <dir> --name <name> --split <split> --suffixarray_dir <dir> --remove <file>
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| --data_dir | str | Yes | Original TFDS data directory (for loading the source dataset) |
| --save_dir | str | Yes | Output directory for the deduplicated TFDS dataset |
| --name | str | Yes | Dataset name (e.g., "wiki40b") |
| --split | str | Yes | Split name (e.g., "test", "train") |
| --suffixarray_dir | str | Yes | Directory containing the .size file from the serialization step |
| --remove | str | Yes | Path to the byte-range removal file (output of collect step) |
Outputs
| Name | Type | Description |
|---|---|---|
| <save_dir>_dedup/<name>/<lang>/<version>/ | TFDS directory | Deduplicated TFDS dataset with cleaned examples, compatible with tfds.load() |
| dataset_info.json | JSON file | Merged metadata across splits within the output directory |
Usage Examples
Deduplicate Wiki40B Test Split
# Prerequisites:
# 1. Wiki40B serialized: /data/sa/wiki40b.test and /data/sa/wiki40b.test.size
# 2. Byte ranges collected: /tmp/wiki40b.test.remove.byterange
python3 scripts/finish_dedup_wiki40b.py \
--data_dir /data/tensorflow_datasets \
--save_dir /data/output \
--name wiki40b \
--split test \
--suffixarray_dir /data/sa \
--remove /tmp/wiki40b.test.remove.byterange
# Output: /data/output_dedup/wiki40b/en/1.3.0/
# Load with: tfds.load("wiki40b", data_dir="/data/output_dedup")
Full Wiki40B Pipeline
# Step 1: Serialize
python3 scripts/load_dataset.py \
--data_dir /data/tensorflow_datasets \
--save_dir /data/sa \
--name wiki40b \
--split test
# Step 2: Build suffix array
python3 scripts/make_suffix_array.py /data/sa/wiki40b.test
# Step 3: Find self-similar duplicates
./target/debug/dedup_dataset self-similar \
--data-file /data/sa/wiki40b.test \
--length-threshold 100 \
--cache-dir /tmp/cache \
--num-threads 8
# Step 4: Collect byte ranges
./target/debug/dedup_dataset collect \
--data-file /data/sa/wiki40b.test \
--cache-dir /tmp/cache \
--length-threshold 100 > /tmp/wiki40b.test.remove.byterange
# Step 5: Apply deduplication to TFDS
python3 scripts/finish_dedup_wiki40b.py \
--data_dir /data/tensorflow_datasets \
--save_dir /data/output \
--name wiki40b \
--split test \
--suffixarray_dir /data/sa \
--remove /tmp/wiki40b.test.remove.byterange