Implementation:Google research Deduplicate text datasets Cmd Self Similar
| Knowledge Sources | |
|---|---|
| Domains | Text_Deduplication, String_Algorithms, NLP |
| Last Updated | 2026-02-14 21:00 GMT |
Overview
Concrete tool for finding all repeated substrings within a single dataset provided by the deduplicate-text-datasets Rust CLI.
Description
The self-similar subcommand of dedup_dataset scans a suffix array to find all pairs of positions that share a common prefix exceeding the length threshold. It streams the suffix array from disk (via TableStream with buffered memory-mapped I/O), holds the data file in memory, and parallelizes the scan across configurable threads using crossbeam scoped threads. Each thread writes its results as zstd-compressed output files containing duplicate positions (dups_*) and their lengths (sizes_*).
Usage
Use this subcommand after building a suffix array with make_suffix_array.py. It is the standard duplicate-finding step for single-file deduplication workflows. For cross-dataset deduplication, use across-similar instead.
Code Reference
Source Location
- Repository: deduplicate-text-datasets
- File: src/main.rs (L122-135 for CLI args, L849-960 for cmd_self_similar)
Signature
dedup_dataset self-similar \
--data-file <path> \
--length-threshold <n> \
--cache-dir <dir> \
[--num-threads <n>] \
[--frequency-threshold <n>] \
[--only-save-one]
// Internal Rust function signature
fn cmd_self_similar(
data_file: String,
length_threshold: usize,
frequency_threshold: usize, // default: 0
only_save_one: bool, // default: false
cache_dir: String,
num_threads: i64, // default: 8
)
Import
# CLI tool, not importable. Requires prior build:
cargo build
# Then invoke:
./target/debug/dedup_dataset self-similar --data-file <path> ...
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| data_file | String | Yes | Path to the flat binary data file (must have <data_file>.table.bin alongside) |
| length_threshold | usize | Yes | Minimum duplicate substring length in bytes |
| cache_dir | String | Yes | Directory for output duplicate/sizes files |
| num_threads | i64 | No | Thread count (default 8) |
| frequency_threshold | usize | No | Minimum frequency for a duplicate to be reported (default 0) |
| only_save_one | bool | No | Save only one copy of each duplicate (default false) |
Outputs
| Name | Type | Description |
|---|---|---|
| dups__-<j> | Binary files (zstd) | Duplicate byte positions per thread chunk, as compressed u64 arrays |
| sizes__-<j> | Binary files (zstd) | Duplicate lengths per thread chunk, as compressed u64 arrays |
Usage Examples
Find Duplicates in a Single File
# Prerequisites: data file and suffix array exist
# e.g., /data/wiki40b.test and /data/wiki40b.test.table.bin
# Find all repeated substrings >= 100 bytes
./target/debug/dedup_dataset self-similar \
--data-file /data/wiki40b.test \
--length-threshold 100 \
--cache-dir /tmp/cache \
--num-threads 16
# Output: /tmp/cache/dups_wiki40b.test_0-N, /tmp/cache/sizes_wiki40b.test_0-N, etc.
Full Single-File Deduplication Pipeline
# 1. Build suffix array
python3 scripts/make_suffix_array.py /data/wiki40b.test
# 2. Find self-similar duplicates
./target/debug/dedup_dataset self-similar \
--data-file /data/wiki40b.test \
--length-threshold 100 \
--cache-dir /tmp/cache \
--num-threads 8
# 3. Collect ranges (next step in pipeline)
./target/debug/dedup_dataset collect \
--data-file /data/wiki40b.test \
--cache-dir /tmp/cache \
--length-threshold 100 > /tmp/remove.byterange