Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Google research Deduplicate text datasets Cmd Collect

From Leeroopedia
Knowledge Sources
Domains Text_Deduplication, Data_Processing
Last Updated 2026-02-14 21:00 GMT

Overview

Concrete tool for merging raw duplicate markers into contiguous byte-range removal lists provided by the deduplicate-text-datasets Rust CLI.

Description

The collect subcommand of dedup_dataset reads all dups_* and sizes_* files from the cache directory produced by the self-similar or across-similar step. It constructs a bitvec (bit-level vector) the size of the data file, marks every byte that participates in a duplicate exceeding the length threshold, then scans the bitvector to extract contiguous byte ranges. The output is written to stdout as a text format with an out header followed by start end pairs.

Usage

Use this subcommand after running self-similar or across-similar to convert raw duplicate markers into an actionable byte-range removal file. Redirect stdout to a file for use by the downstream removal step.

Code Reference

Source Location

Signature

dedup_dataset collect \
    --data-file <path> \
    --cache-dir <dir> \
    --length-threshold <n>
// Internal Rust function signature
fn cmd_collect(
    data_file: String,
    cache_dir: String,
    length_threshold: u64,
)

Import

# CLI tool, not importable. Requires prior build:
cargo build
# Then invoke:
./target/debug/dedup_dataset collect --data-file <path> ...

I/O Contract

Inputs

Name Type Required Description
data_file String Yes Path to the flat binary data file (must have .table.bin alongside)
cache_dir String Yes Directory containing dups_* and sizes_* files from self-similar or across-similar
length_threshold u64 Yes Minimum duplicate length to include in output ranges

Outputs

Name Type Description
stdout Text stream Lines starting with "out" header, followed by "start end" byte range pairs in ascending order

Usage Examples

Collect and Save Byte Ranges

# After running self-similar, collect ranges and redirect to file
./target/debug/dedup_dataset collect \
    --data-file /data/wiki40b.test \
    --cache-dir /tmp/cache \
    --length-threshold 100 > /tmp/wiki40b.test.remove.byterange

# Output format:
# out
# 1234 5678
# 9012 9500
# ...

Use in a Pipeline

# Full pipeline: build SA -> find dups -> collect -> remove
python3 scripts/make_suffix_array.py /data/wiki40b.test

./target/debug/dedup_dataset self-similar \
    --data-file /data/wiki40b.test \
    --length-threshold 100 \
    --cache-dir /tmp/cache \
    --num-threads 8

./target/debug/dedup_dataset collect \
    --data-file /data/wiki40b.test \
    --cache-dir /tmp/cache \
    --length-threshold 100 > /tmp/remove.byterange

python3 scripts/finish_single_file.py /data/wiki40b.test /tmp/remove.byterange /data/wiki40b.test.deduped

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment