Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Google research Deduplicate text datasets Finish Single File

From Leeroopedia
Knowledge Sources
Domains Text_Deduplication, Data_Processing
Last Updated 2026-02-14 21:00 GMT

Overview

Concrete tool for excising duplicate byte ranges from a flat binary file provided by the deduplicate-text-datasets repository.

Description

The finish_single_file.py script reads a removal file (output from collect) and the original binary data file, then writes a new file with all specified byte ranges removed. It parses the removal file by skipping lines until it finds the out marker, then reads all subsequent start end pairs. Ranges are processed in reverse order (using a stack-like pop pattern) so the file can be read in a single forward pass.

This is a pure-Python script with no external dependencies beyond sys.

Usage

Use this script as the final step in the single-file deduplication workflow or in cross-dataset deduplication when working with flat binary files. For TFDS-structured datasets, use finish_dedup_wiki40b.py instead.

Code Reference

Source Location

Signature

python3 scripts/finish_single_file.py <original_file> <remove_file> <output_file>
# Script entry point (not importable as a function)
# sys.argv[1] = original file path
# sys.argv[2] = removal file path (output of collect)
# sys.argv[3] = deduplicated output file path

Import

# CLI script, not importable as a library.
python3 scripts/finish_single_file.py <original> <remove_file> <output>

I/O Contract

Inputs

Name Type Required Description
original str (positional) Yes Path to the original flat binary data file
remove_file str (positional) Yes Path to the byte-range removal file (from collect step); format: "out" header then "start end" pairs
deduped str (positional) Yes Path for the deduplicated output file

Outputs

Name Type Description
output_file Binary file Deduplicated flat binary file with all specified byte ranges excised

Usage Examples

Basic Deduplication

# Given: /data/wiki40b.test (original) and /tmp/remove.byterange (from collect)
python3 scripts/finish_single_file.py \
    /data/wiki40b.test \
    /tmp/remove.byterange \
    /data/wiki40b.test.deduped

# Verify the output is smaller than the original
ls -la /data/wiki40b.test /data/wiki40b.test.deduped

End-to-End Single File Pipeline

# Complete pipeline from raw file to deduplicated output
DATA=/data/my_corpus.bin

# Step 1: Build suffix array
python3 scripts/make_suffix_array.py $DATA

# Step 2: Find self-similar duplicates
./target/debug/dedup_dataset self-similar \
    --data-file $DATA \
    --length-threshold 100 \
    --cache-dir /tmp/cache \
    --num-threads 8

# Step 3: Collect byte ranges
./target/debug/dedup_dataset collect \
    --data-file $DATA \
    --cache-dir /tmp/cache \
    --length-threshold 100 > /tmp/remove.byterange

# Step 4: Apply removal
python3 scripts/finish_single_file.py $DATA /tmp/remove.byterange ${DATA}.deduped

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment