Implementation:Google research Deduplicate text datasets Finish Dedup Wiki40b

Knowledge Sources	deduplicate-text-datasets TensorFlow Datasets Deduplicating Training Data Makes Language Models Better
Domains	Text_Deduplication, Data_Processing, NLP
Last Updated	2026-02-14 21:00 GMT

Overview

Concrete tool for applying byte-level deduplication results back to a Wiki40B TensorFlow Dataset provided by the deduplicate-text-datasets repository.

Description

The finish_dedup_wiki40b.py script performs the final step of the Wiki40B deduplication pipeline. It maps global byte-range removals to per-example substring ranges using the .size offset file, applies those removals to each example's text field, and rebuilds a valid TFDS dataset using a custom MyDataset(tfds.core.GeneratorBasedBuilder). After generation, it restructures the output directory to match Wiki40B's expected TFDS layout (wiki40b/en/1.3.0/) and merges dataset_info.json metadata across splits.

The script uses multiprocessing to parallelize the per-example deduplication and processes the original dataset in batches of 2^16 examples.

Usage

Use this script as the final step in the Wiki40B TFDS deduplication workflow. It requires the original TFDS dataset, the .size file from the serialization step, and the byte-range removal file from the collect step. Note: this script is currently hardcoded for the Wiki40B dataset schema; other TFDS datasets would require modification.

Code Reference

Source Location

Repository: deduplicate-text-datasets
File: scripts/finish_dedup_wiki40b.py (L1-199)

Signature

python3 scripts/finish_dedup_wiki40b.py \
    --data_dir <original_tfds_directory> \
    --save_dir <output_directory> \
    --name <dataset_name> \
    --split <split_name> \
    --suffixarray_dir <directory_containing_size_file> \
    --remove <byte_range_removal_file>

Import

# CLI script, not importable as a library.
python3 scripts/finish_dedup_wiki40b.py --data_dir <dir> --save_dir <dir> --name <name> --split <split> --suffixarray_dir <dir> --remove <file>

I/O Contract

Inputs

Name	Type	Required	Description
--data_dir	str	Yes	Original TFDS data directory (for loading the source dataset)
--save_dir	str	Yes	Output directory for the deduplicated TFDS dataset
--name	str	Yes	Dataset name (e.g., "wiki40b")
--split	str	Yes	Split name (e.g., "test", "train")
--suffixarray_dir	str	Yes	Directory containing the .size file from the serialization step
--remove	str	Yes	Path to the byte-range removal file (output of collect step)

Outputs

Name	Type	Description
<save_dir>_dedup/<name>/<lang>/<version>/	TFDS directory	Deduplicated TFDS dataset with cleaned examples, compatible with tfds.load()
dataset_info.json	JSON file	Merged metadata across splits within the output directory

Usage Examples

Deduplicate Wiki40B Test Split

# Prerequisites:
# 1. Wiki40B serialized: /data/sa/wiki40b.test and /data/sa/wiki40b.test.size
# 2. Byte ranges collected: /tmp/wiki40b.test.remove.byterange

python3 scripts/finish_dedup_wiki40b.py \
    --data_dir /data/tensorflow_datasets \
    --save_dir /data/output \
    --name wiki40b \
    --split test \
    --suffixarray_dir /data/sa \
    --remove /tmp/wiki40b.test.remove.byterange

# Output: /data/output_dedup/wiki40b/en/1.3.0/
# Load with: tfds.load("wiki40b", data_dir="/data/output_dedup")

Full Wiki40B Pipeline

# Step 1: Serialize
python3 scripts/load_dataset.py \
    --data_dir /data/tensorflow_datasets \
    --save_dir /data/sa \
    --name wiki40b \
    --split test

# Step 2: Build suffix array
python3 scripts/make_suffix_array.py /data/sa/wiki40b.test

# Step 3: Find self-similar duplicates
./target/debug/dedup_dataset self-similar \
    --data-file /data/sa/wiki40b.test \
    --length-threshold 100 \
    --cache-dir /tmp/cache \
    --num-threads 8

# Step 4: Collect byte ranges
./target/debug/dedup_dataset collect \
    --data-file /data/sa/wiki40b.test \
    --cache-dir /tmp/cache \
    --length-threshold 100 > /tmp/wiki40b.test.remove.byterange

# Step 5: Apply deduplication to TFDS
python3 scripts/finish_dedup_wiki40b.py \
    --data_dir /data/tensorflow_datasets \
    --save_dir /data/output \
    --name wiki40b \
    --split test \
    --suffixarray_dir /data/sa \
    --remove /tmp/wiki40b.test.remove.byterange

Related Pages

Implements Principle

Principle:Google_research_Deduplicate_text_datasets_TFDS_Deduplication_Application

Requires Environment

Environment:Google_research_Deduplicate_text_datasets_Python_TFDS_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment