Implementation:Google research Deduplicate text datasets Cmd Across Similar

Knowledge Sources	deduplicate-text-datasets Deduplicating Training Data Makes Language Models Better
Domains	Text_Deduplication, String_Algorithms, NLP
Last Updated	2026-02-14 21:00 GMT

Overview

Concrete tool for finding shared substrings between two datasets provided by the deduplicate-text-datasets Rust CLI.

Description

The across-similar subcommand of dedup_dataset takes two data files (each with a precomputed suffix array) and finds all substrings that appear in both datasets exceeding the length threshold. It uses a coordinated merge-walk over both suffix arrays, streams them from disk using TableStream, and parallelizes across configurable threads using crossbeam. Output is bidirectional: dups_S1_*/sizes_S1_* for duplicates found in dataset 1 that match dataset 2, and dups_S2_*/sizes_S2_* for the reverse.

Usage

Use this subcommand when detecting shared content between two datasets (e.g., train/test overlap). Both datasets must already be serialized to flat binary files with suffix arrays built. Use self-similar instead if finding duplicates within a single file.

Code Reference

Source Location

Repository: deduplicate-text-datasets
File: src/main.rs (L137-148 for CLI args, L985-1211 for cmd_across_similar)

Signature

dedup_dataset across-similar \
    --data-file-1 <path1> \
    --data-file-2 <path2> \
    --length-threshold <n> \
    --cache-dir <dir> \
    [--num-threads <n>]

// Internal Rust function signature
fn cmd_across_similar(
    data_file_1: String,
    data_file_2: String,
    cache_dir: String,
    length_threshold: usize,
    num_threads: i64,  // default: 8
)

Import

# CLI tool, not importable. Requires prior build:
cargo build
# Then invoke:
./target/debug/dedup_dataset across-similar --data-file-1 <path1> --data-file-2 <path2> ...

I/O Contract

Inputs

Name	Type	Required	Description
data_file_1	String	Yes	Path to the first dataset (must have .table.bin alongside)
data_file_2	String	Yes	Path to the second dataset (must have .table.bin alongside)
length_threshold	usize	Yes	Minimum shared substring length in bytes
cache_dir	String	Yes	Directory for output duplicate/sizes files
num_threads	i64	No	Thread count (default 8)

Outputs

Name	Type	Description
dups_S1__-<j>	Binary files (zstd)	Positions in dataset 1 of substrings found in dataset 2
sizes_S1__-<j>	Binary files (zstd)	Lengths of dataset-1 duplicates
dups_S2__-<j>	Binary files (zstd)	Positions in dataset 2 of substrings found in dataset 1
sizes_S2__-<j>	Binary files (zstd)	Lengths of dataset-2 duplicates

Usage Examples

Find Train/Test Overlap

# Prerequisites: both datasets serialized and suffix arrays built
# /data/train.bin + /data/train.bin.table.bin
# /data/test.bin  + /data/test.bin.table.bin

# Find substrings >= 50 bytes shared between train and test
./target/debug/dedup_dataset across-similar \
    --data-file-1 /data/train.bin \
    --data-file-2 /data/test.bin \
    --length-threshold 50 \
    --cache-dir /tmp/cross_cache \
    --num-threads 16

# Output:
# /tmp/cross_cache/dups_S1_train.bin_*   (train dups found in test)
# /tmp/cross_cache/dups_S2_test.bin_*    (test dups found in train)

Full Cross-Dataset Pipeline

# 1. Serialize both datasets
python3 scripts/load_dataset_hf.py --save_dir /data/sa --name c4 --split train --subset en
python3 scripts/load_dataset_hf.py --save_dir /data/sa --name c4 --split validation --subset en

# 2. Build suffix arrays for both
python3 scripts/make_suffix_array.py /data/sa/c4.train
python3 scripts/make_suffix_array.py /data/sa/c4.validation

# 3. Find across-similar duplicates
./target/debug/dedup_dataset across-similar \
    --data-file-1 /data/sa/c4.train \
    --data-file-2 /data/sa/c4.validation \
    --length-threshold 100 \
    --cache-dir /tmp/cross_cache \
    --num-threads 8

# 4. Collect ranges (from the dataset you want to clean)
./target/debug/dedup_dataset collect \
    --data-file /data/sa/c4.train \
    --cache-dir /tmp/cross_cache \
    --length-threshold 100 > /tmp/remove_train.byterange

# 5. Remove duplicates
python3 scripts/finish_single_file.py \
    /data/sa/c4.train \
    /tmp/remove_train.byterange \
    /data/sa/c4.train.deduped

Related Pages

Implements Principle

Principle:Google_research_Deduplicate_text_datasets_Cross_Dataset_Duplicate_Detection

Requires Environment

Environment:Google_research_Deduplicate_text_datasets_Rust_Cargo_Build_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment