Implementation:Google research Deduplicate text datasets Cmd Across Similar
| Knowledge Sources | |
|---|---|
| Domains | Text_Deduplication, String_Algorithms, NLP |
| Last Updated | 2026-02-14 21:00 GMT |
Overview
Concrete tool for finding shared substrings between two datasets provided by the deduplicate-text-datasets Rust CLI.
Description
The across-similar subcommand of dedup_dataset takes two data files (each with a precomputed suffix array) and finds all substrings that appear in both datasets exceeding the length threshold. It uses a coordinated merge-walk over both suffix arrays, streams them from disk using TableStream, and parallelizes across configurable threads using crossbeam. Output is bidirectional: dups_S1_*/sizes_S1_* for duplicates found in dataset 1 that match dataset 2, and dups_S2_*/sizes_S2_* for the reverse.
Usage
Use this subcommand when detecting shared content between two datasets (e.g., train/test overlap). Both datasets must already be serialized to flat binary files with suffix arrays built. Use self-similar instead if finding duplicates within a single file.
Code Reference
Source Location
- Repository: deduplicate-text-datasets
- File: src/main.rs (L137-148 for CLI args, L985-1211 for cmd_across_similar)
Signature
dedup_dataset across-similar \
--data-file-1 <path1> \
--data-file-2 <path2> \
--length-threshold <n> \
--cache-dir <dir> \
[--num-threads <n>]
// Internal Rust function signature
fn cmd_across_similar(
data_file_1: String,
data_file_2: String,
cache_dir: String,
length_threshold: usize,
num_threads: i64, // default: 8
)
Import
# CLI tool, not importable. Requires prior build:
cargo build
# Then invoke:
./target/debug/dedup_dataset across-similar --data-file-1 <path1> --data-file-2 <path2> ...
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| data_file_1 | String | Yes | Path to the first dataset (must have .table.bin alongside) |
| data_file_2 | String | Yes | Path to the second dataset (must have .table.bin alongside) |
| length_threshold | usize | Yes | Minimum shared substring length in bytes |
| cache_dir | String | Yes | Directory for output duplicate/sizes files |
| num_threads | i64 | No | Thread count (default 8) |
Outputs
| Name | Type | Description |
|---|---|---|
| dups_S1__-<j> | Binary files (zstd) | Positions in dataset 1 of substrings found in dataset 2 |
| sizes_S1__-<j> | Binary files (zstd) | Lengths of dataset-1 duplicates |
| dups_S2__-<j> | Binary files (zstd) | Positions in dataset 2 of substrings found in dataset 1 |
| sizes_S2__-<j> | Binary files (zstd) | Lengths of dataset-2 duplicates |
Usage Examples
Find Train/Test Overlap
# Prerequisites: both datasets serialized and suffix arrays built
# /data/train.bin + /data/train.bin.table.bin
# /data/test.bin + /data/test.bin.table.bin
# Find substrings >= 50 bytes shared between train and test
./target/debug/dedup_dataset across-similar \
--data-file-1 /data/train.bin \
--data-file-2 /data/test.bin \
--length-threshold 50 \
--cache-dir /tmp/cross_cache \
--num-threads 16
# Output:
# /tmp/cross_cache/dups_S1_train.bin_* (train dups found in test)
# /tmp/cross_cache/dups_S2_test.bin_* (test dups found in train)
Full Cross-Dataset Pipeline
# 1. Serialize both datasets
python3 scripts/load_dataset_hf.py --save_dir /data/sa --name c4 --split train --subset en
python3 scripts/load_dataset_hf.py --save_dir /data/sa --name c4 --split validation --subset en
# 2. Build suffix arrays for both
python3 scripts/make_suffix_array.py /data/sa/c4.train
python3 scripts/make_suffix_array.py /data/sa/c4.validation
# 3. Find across-similar duplicates
./target/debug/dedup_dataset across-similar \
--data-file-1 /data/sa/c4.train \
--data-file-2 /data/sa/c4.validation \
--length-threshold 100 \
--cache-dir /tmp/cross_cache \
--num-threads 8
# 4. Collect ranges (from the dataset you want to clean)
./target/debug/dedup_dataset collect \
--data-file /data/sa/c4.train \
--cache-dir /tmp/cross_cache \
--length-threshold 100 > /tmp/remove_train.byterange
# 5. Remove duplicates
python3 scripts/finish_single_file.py \
/data/sa/c4.train \
/tmp/remove_train.byterange \
/data/sa/c4.train.deduped