Implementation:Google research Deduplicate text datasets Cmd Self Similar

Knowledge Sources	deduplicate-text-datasets Deduplicating Training Data Makes Language Models Better
Domains	Text_Deduplication, String_Algorithms, NLP
Last Updated	2026-02-14 21:00 GMT

Overview

Concrete tool for finding all repeated substrings within a single dataset provided by the deduplicate-text-datasets Rust CLI.

Description

The self-similar subcommand of dedup_dataset scans a suffix array to find all pairs of positions that share a common prefix exceeding the length threshold. It streams the suffix array from disk (via TableStream with buffered memory-mapped I/O), holds the data file in memory, and parallelizes the scan across configurable threads using crossbeam scoped threads. Each thread writes its results as zstd-compressed output files containing duplicate positions (dups_*) and their lengths (sizes_*).

Usage

Use this subcommand after building a suffix array with make_suffix_array.py. It is the standard duplicate-finding step for single-file deduplication workflows. For cross-dataset deduplication, use across-similar instead.

Code Reference

Source Location

Repository: deduplicate-text-datasets
File: src/main.rs (L122-135 for CLI args, L849-960 for cmd_self_similar)

Signature

dedup_dataset self-similar \
    --data-file <path> \
    --length-threshold <n> \
    --cache-dir <dir> \
    [--num-threads <n>] \
    [--frequency-threshold <n>] \
    [--only-save-one]

// Internal Rust function signature
fn cmd_self_similar(
    data_file: String,
    length_threshold: usize,
    frequency_threshold: usize,  // default: 0
    only_save_one: bool,         // default: false
    cache_dir: String,
    num_threads: i64,            // default: 8
)

Import

# CLI tool, not importable. Requires prior build:
cargo build
# Then invoke:
./target/debug/dedup_dataset self-similar --data-file <path> ...

I/O Contract

Inputs

Name	Type	Required	Description
data_file	String	Yes	Path to the flat binary data file (must have <data_file>.table.bin alongside)
length_threshold	usize	Yes	Minimum duplicate substring length in bytes
cache_dir	String	Yes	Directory for output duplicate/sizes files
num_threads	i64	No	Thread count (default 8)
frequency_threshold	usize	No	Minimum frequency for a duplicate to be reported (default 0)
only_save_one	bool	No	Save only one copy of each duplicate (default false)

Outputs

Name	Type	Description
dups__-<j>	Binary files (zstd)	Duplicate byte positions per thread chunk, as compressed u64 arrays
sizes__-<j>	Binary files (zstd)	Duplicate lengths per thread chunk, as compressed u64 arrays

Usage Examples

Find Duplicates in a Single File

# Prerequisites: data file and suffix array exist
# e.g., /data/wiki40b.test and /data/wiki40b.test.table.bin

# Find all repeated substrings >= 100 bytes
./target/debug/dedup_dataset self-similar \
    --data-file /data/wiki40b.test \
    --length-threshold 100 \
    --cache-dir /tmp/cache \
    --num-threads 16

# Output: /tmp/cache/dups_wiki40b.test_0-N, /tmp/cache/sizes_wiki40b.test_0-N, etc.

Full Single-File Deduplication Pipeline

# 1. Build suffix array
python3 scripts/make_suffix_array.py /data/wiki40b.test

# 2. Find self-similar duplicates
./target/debug/dedup_dataset self-similar \
    --data-file /data/wiki40b.test \
    --length-threshold 100 \
    --cache-dir /tmp/cache \
    --num-threads 8

# 3. Collect ranges (next step in pipeline)
./target/debug/dedup_dataset collect \
    --data-file /data/wiki40b.test \
    --cache-dir /tmp/cache \
    --length-threshold 100 > /tmp/remove.byterange

Related Pages

Implements Principle

Principle:Google_research_Deduplicate_text_datasets_Self_Similar_Duplicate_Detection

Requires Environment

Environment:Google_research_Deduplicate_text_datasets_Rust_Cargo_Build_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment