Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Google research Deduplicate text datasets Finish Dedup Wiki40b

From Leeroopedia
Knowledge Sources
Domains Text_Deduplication, Data_Processing, NLP
Last Updated 2026-02-14 21:00 GMT

Overview

Concrete tool for applying byte-level deduplication results back to a Wiki40B TensorFlow Dataset provided by the deduplicate-text-datasets repository.

Description

The finish_dedup_wiki40b.py script performs the final step of the Wiki40B deduplication pipeline. It maps global byte-range removals to per-example substring ranges using the .size offset file, applies those removals to each example's text field, and rebuilds a valid TFDS dataset using a custom MyDataset(tfds.core.GeneratorBasedBuilder). After generation, it restructures the output directory to match Wiki40B's expected TFDS layout (wiki40b/en/1.3.0/) and merges dataset_info.json metadata across splits.

The script uses multiprocessing to parallelize the per-example deduplication and processes the original dataset in batches of 2^16 examples.

Usage

Use this script as the final step in the Wiki40B TFDS deduplication workflow. It requires the original TFDS dataset, the .size file from the serialization step, and the byte-range removal file from the collect step. Note: this script is currently hardcoded for the Wiki40B dataset schema; other TFDS datasets would require modification.

Code Reference

Source Location

Signature

python3 scripts/finish_dedup_wiki40b.py \
    --data_dir <original_tfds_directory> \
    --save_dir <output_directory> \
    --name <dataset_name> \
    --split <split_name> \
    --suffixarray_dir <directory_containing_size_file> \
    --remove <byte_range_removal_file>

Import

# CLI script, not importable as a library.
python3 scripts/finish_dedup_wiki40b.py --data_dir <dir> --save_dir <dir> --name <name> --split <split> --suffixarray_dir <dir> --remove <file>

I/O Contract

Inputs

Name Type Required Description
--data_dir str Yes Original TFDS data directory (for loading the source dataset)
--save_dir str Yes Output directory for the deduplicated TFDS dataset
--name str Yes Dataset name (e.g., "wiki40b")
--split str Yes Split name (e.g., "test", "train")
--suffixarray_dir str Yes Directory containing the .size file from the serialization step
--remove str Yes Path to the byte-range removal file (output of collect step)

Outputs

Name Type Description
<save_dir>_dedup/<name>/<lang>/<version>/ TFDS directory Deduplicated TFDS dataset with cleaned examples, compatible with tfds.load()
dataset_info.json JSON file Merged metadata across splits within the output directory

Usage Examples

Deduplicate Wiki40B Test Split

# Prerequisites:
# 1. Wiki40B serialized: /data/sa/wiki40b.test and /data/sa/wiki40b.test.size
# 2. Byte ranges collected: /tmp/wiki40b.test.remove.byterange

python3 scripts/finish_dedup_wiki40b.py \
    --data_dir /data/tensorflow_datasets \
    --save_dir /data/output \
    --name wiki40b \
    --split test \
    --suffixarray_dir /data/sa \
    --remove /tmp/wiki40b.test.remove.byterange

# Output: /data/output_dedup/wiki40b/en/1.3.0/
# Load with: tfds.load("wiki40b", data_dir="/data/output_dedup")

Full Wiki40B Pipeline

# Step 1: Serialize
python3 scripts/load_dataset.py \
    --data_dir /data/tensorflow_datasets \
    --save_dir /data/sa \
    --name wiki40b \
    --split test

# Step 2: Build suffix array
python3 scripts/make_suffix_array.py /data/sa/wiki40b.test

# Step 3: Find self-similar duplicates
./target/debug/dedup_dataset self-similar \
    --data-file /data/sa/wiki40b.test \
    --length-threshold 100 \
    --cache-dir /tmp/cache \
    --num-threads 8

# Step 4: Collect byte ranges
./target/debug/dedup_dataset collect \
    --data-file /data/sa/wiki40b.test \
    --cache-dir /tmp/cache \
    --length-threshold 100 > /tmp/wiki40b.test.remove.byterange

# Step 5: Apply deduplication to TFDS
python3 scripts/finish_dedup_wiki40b.py \
    --data_dir /data/tensorflow_datasets \
    --save_dir /data/output \
    --name wiki40b \
    --split test \
    --suffixarray_dir /data/sa \
    --remove /tmp/wiki40b.test.remove.byterange

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment