Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Google research Deduplicate text datasets Cmd Self Similar

From Leeroopedia
Knowledge Sources
Domains Text_Deduplication, String_Algorithms, NLP
Last Updated 2026-02-14 21:00 GMT

Overview

Concrete tool for finding all repeated substrings within a single dataset provided by the deduplicate-text-datasets Rust CLI.

Description

The self-similar subcommand of dedup_dataset scans a suffix array to find all pairs of positions that share a common prefix exceeding the length threshold. It streams the suffix array from disk (via TableStream with buffered memory-mapped I/O), holds the data file in memory, and parallelizes the scan across configurable threads using crossbeam scoped threads. Each thread writes its results as zstd-compressed output files containing duplicate positions (dups_*) and their lengths (sizes_*).

Usage

Use this subcommand after building a suffix array with make_suffix_array.py. It is the standard duplicate-finding step for single-file deduplication workflows. For cross-dataset deduplication, use across-similar instead.

Code Reference

Source Location

Signature

dedup_dataset self-similar \
    --data-file <path> \
    --length-threshold <n> \
    --cache-dir <dir> \
    [--num-threads <n>] \
    [--frequency-threshold <n>] \
    [--only-save-one]
// Internal Rust function signature
fn cmd_self_similar(
    data_file: String,
    length_threshold: usize,
    frequency_threshold: usize,  // default: 0
    only_save_one: bool,         // default: false
    cache_dir: String,
    num_threads: i64,            // default: 8
)

Import

# CLI tool, not importable. Requires prior build:
cargo build
# Then invoke:
./target/debug/dedup_dataset self-similar --data-file <path> ...

I/O Contract

Inputs

Name Type Required Description
data_file String Yes Path to the flat binary data file (must have <data_file>.table.bin alongside)
length_threshold usize Yes Minimum duplicate substring length in bytes
cache_dir String Yes Directory for output duplicate/sizes files
num_threads i64 No Thread count (default 8)
frequency_threshold usize No Minimum frequency for a duplicate to be reported (default 0)
only_save_one bool No Save only one copy of each duplicate (default false)

Outputs

Name Type Description
dups__-<j> Binary files (zstd) Duplicate byte positions per thread chunk, as compressed u64 arrays
sizes__-<j> Binary files (zstd) Duplicate lengths per thread chunk, as compressed u64 arrays

Usage Examples

Find Duplicates in a Single File

# Prerequisites: data file and suffix array exist
# e.g., /data/wiki40b.test and /data/wiki40b.test.table.bin

# Find all repeated substrings >= 100 bytes
./target/debug/dedup_dataset self-similar \
    --data-file /data/wiki40b.test \
    --length-threshold 100 \
    --cache-dir /tmp/cache \
    --num-threads 16

# Output: /tmp/cache/dups_wiki40b.test_0-N, /tmp/cache/sizes_wiki40b.test_0-N, etc.

Full Single-File Deduplication Pipeline

# 1. Build suffix array
python3 scripts/make_suffix_array.py /data/wiki40b.test

# 2. Find self-similar duplicates
./target/debug/dedup_dataset self-similar \
    --data-file /data/wiki40b.test \
    --length-threshold 100 \
    --cache-dir /tmp/cache \
    --num-threads 8

# 3. Collect ranges (next step in pipeline)
./target/debug/dedup_dataset collect \
    --data-file /data/wiki40b.test \
    --cache-dir /tmp/cache \
    --length-threshold 100 > /tmp/remove.byterange

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment