Implementation:Google research Deduplicate text datasets Cmd Collect

Knowledge Sources	deduplicate-text-datasets Deduplicating Training Data Makes Language Models Better
Domains	Text_Deduplication, Data_Processing
Last Updated	2026-02-14 21:00 GMT

Overview

Concrete tool for merging raw duplicate markers into contiguous byte-range removal lists provided by the deduplicate-text-datasets Rust CLI.

Description

The collect subcommand of dedup_dataset reads all dups_* and sizes_* files from the cache directory produced by the self-similar or across-similar step. It constructs a bitvec (bit-level vector) the size of the data file, marks every byte that participates in a duplicate exceeding the length threshold, then scans the bitvector to extract contiguous byte ranges. The output is written to stdout as a text format with an out header followed by start end pairs.

Usage

Use this subcommand after running self-similar or across-similar to convert raw duplicate markers into an actionable byte-range removal file. Redirect stdout to a file for use by the downstream removal step.

Code Reference

Source Location

Repository: deduplicate-text-datasets
File: src/main.rs (L159-166 for CLI args, L1430-1526 for cmd_collect)

Signature

dedup_dataset collect \
    --data-file <path> \
    --cache-dir <dir> \
    --length-threshold <n>

// Internal Rust function signature
fn cmd_collect(
    data_file: String,
    cache_dir: String,
    length_threshold: u64,
)

Import

# CLI tool, not importable. Requires prior build:
cargo build
# Then invoke:
./target/debug/dedup_dataset collect --data-file <path> ...

I/O Contract

Inputs

Name	Type	Required	Description
data_file	String	Yes	Path to the flat binary data file (must have .table.bin alongside)
cache_dir	String	Yes	Directory containing dups_* and sizes_* files from self-similar or across-similar
length_threshold	u64	Yes	Minimum duplicate length to include in output ranges

Outputs

Name	Type	Description
stdout	Text stream	Lines starting with "out" header, followed by "start end" byte range pairs in ascending order

Usage Examples

Collect and Save Byte Ranges

# After running self-similar, collect ranges and redirect to file
./target/debug/dedup_dataset collect \
    --data-file /data/wiki40b.test \
    --cache-dir /tmp/cache \
    --length-threshold 100 > /tmp/wiki40b.test.remove.byterange

# Output format:
# out
# 1234 5678
# 9012 9500
# ...

Use in a Pipeline

# Full pipeline: build SA -> find dups -> collect -> remove
python3 scripts/make_suffix_array.py /data/wiki40b.test

./target/debug/dedup_dataset self-similar \
    --data-file /data/wiki40b.test \
    --length-threshold 100 \
    --cache-dir /tmp/cache \
    --num-threads 8

./target/debug/dedup_dataset collect \
    --data-file /data/wiki40b.test \
    --cache-dir /tmp/cache \
    --length-threshold 100 > /tmp/remove.byterange

python3 scripts/finish_single_file.py /data/wiki40b.test /tmp/remove.byterange /data/wiki40b.test.deduped

Related Pages

Implements Principle

Principle:Google_research_Deduplicate_text_datasets_Duplicate_Range_Collection

Requires Environment

Environment:Google_research_Deduplicate_text_datasets_Rust_Cargo_Build_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment