Implementation:Google research Deduplicate text datasets Cmd Collect
| Knowledge Sources | |
|---|---|
| Domains | Text_Deduplication, Data_Processing |
| Last Updated | 2026-02-14 21:00 GMT |
Overview
Concrete tool for merging raw duplicate markers into contiguous byte-range removal lists provided by the deduplicate-text-datasets Rust CLI.
Description
The collect subcommand of dedup_dataset reads all dups_* and sizes_* files from the cache directory produced by the self-similar or across-similar step. It constructs a bitvec (bit-level vector) the size of the data file, marks every byte that participates in a duplicate exceeding the length threshold, then scans the bitvector to extract contiguous byte ranges. The output is written to stdout as a text format with an out header followed by start end pairs.
Usage
Use this subcommand after running self-similar or across-similar to convert raw duplicate markers into an actionable byte-range removal file. Redirect stdout to a file for use by the downstream removal step.
Code Reference
Source Location
- Repository: deduplicate-text-datasets
- File: src/main.rs (L159-166 for CLI args, L1430-1526 for cmd_collect)
Signature
dedup_dataset collect \
--data-file <path> \
--cache-dir <dir> \
--length-threshold <n>
// Internal Rust function signature
fn cmd_collect(
data_file: String,
cache_dir: String,
length_threshold: u64,
)
Import
# CLI tool, not importable. Requires prior build:
cargo build
# Then invoke:
./target/debug/dedup_dataset collect --data-file <path> ...
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| data_file | String | Yes | Path to the flat binary data file (must have .table.bin alongside) |
| cache_dir | String | Yes | Directory containing dups_* and sizes_* files from self-similar or across-similar |
| length_threshold | u64 | Yes | Minimum duplicate length to include in output ranges |
Outputs
| Name | Type | Description |
|---|---|---|
| stdout | Text stream | Lines starting with "out" header, followed by "start end" byte range pairs in ascending order |
Usage Examples
Collect and Save Byte Ranges
# After running self-similar, collect ranges and redirect to file
./target/debug/dedup_dataset collect \
--data-file /data/wiki40b.test \
--cache-dir /tmp/cache \
--length-threshold 100 > /tmp/wiki40b.test.remove.byterange
# Output format:
# out
# 1234 5678
# 9012 9500
# ...
Use in a Pipeline
# Full pipeline: build SA -> find dups -> collect -> remove
python3 scripts/make_suffix_array.py /data/wiki40b.test
./target/debug/dedup_dataset self-similar \
--data-file /data/wiki40b.test \
--length-threshold 100 \
--cache-dir /tmp/cache \
--num-threads 8
./target/debug/dedup_dataset collect \
--data-file /data/wiki40b.test \
--cache-dir /tmp/cache \
--length-threshold 100 > /tmp/remove.byterange
python3 scripts/finish_single_file.py /data/wiki40b.test /tmp/remove.byterange /data/wiki40b.test.deduped