Implementation:Google research Deduplicate text datasets Finish Single File
| Knowledge Sources | |
|---|---|
| Domains | Text_Deduplication, Data_Processing |
| Last Updated | 2026-02-14 21:00 GMT |
Overview
Concrete tool for excising duplicate byte ranges from a flat binary file provided by the deduplicate-text-datasets repository.
Description
The finish_single_file.py script reads a removal file (output from collect) and the original binary data file, then writes a new file with all specified byte ranges removed. It parses the removal file by skipping lines until it finds the out marker, then reads all subsequent start end pairs. Ranges are processed in reverse order (using a stack-like pop pattern) so the file can be read in a single forward pass.
This is a pure-Python script with no external dependencies beyond sys.
Usage
Use this script as the final step in the single-file deduplication workflow or in cross-dataset deduplication when working with flat binary files. For TFDS-structured datasets, use finish_dedup_wiki40b.py instead.
Code Reference
Source Location
- Repository: deduplicate-text-datasets
- File: scripts/finish_single_file.py (L1-38)
Signature
python3 scripts/finish_single_file.py <original_file> <remove_file> <output_file>
# Script entry point (not importable as a function)
# sys.argv[1] = original file path
# sys.argv[2] = removal file path (output of collect)
# sys.argv[3] = deduplicated output file path
Import
# CLI script, not importable as a library.
python3 scripts/finish_single_file.py <original> <remove_file> <output>
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| original | str (positional) | Yes | Path to the original flat binary data file |
| remove_file | str (positional) | Yes | Path to the byte-range removal file (from collect step); format: "out" header then "start end" pairs |
| deduped | str (positional) | Yes | Path for the deduplicated output file |
Outputs
| Name | Type | Description |
|---|---|---|
| output_file | Binary file | Deduplicated flat binary file with all specified byte ranges excised |
Usage Examples
Basic Deduplication
# Given: /data/wiki40b.test (original) and /tmp/remove.byterange (from collect)
python3 scripts/finish_single_file.py \
/data/wiki40b.test \
/tmp/remove.byterange \
/data/wiki40b.test.deduped
# Verify the output is smaller than the original
ls -la /data/wiki40b.test /data/wiki40b.test.deduped
End-to-End Single File Pipeline
# Complete pipeline from raw file to deduplicated output
DATA=/data/my_corpus.bin
# Step 1: Build suffix array
python3 scripts/make_suffix_array.py $DATA
# Step 2: Find self-similar duplicates
./target/debug/dedup_dataset self-similar \
--data-file $DATA \
--length-threshold 100 \
--cache-dir /tmp/cache \
--num-threads 8
# Step 3: Collect byte ranges
./target/debug/dedup_dataset collect \
--data-file $DATA \
--cache-dir /tmp/cache \
--length-threshold 100 > /tmp/remove.byterange
# Step 4: Apply removal
python3 scripts/finish_single_file.py $DATA /tmp/remove.byterange ${DATA}.deduped