Implementation:Google research Deduplicate text datasets Finish Single File

Knowledge Sources	deduplicate-text-datasets Deduplicating Training Data Makes Language Models Better
Domains	Text_Deduplication, Data_Processing
Last Updated	2026-02-14 21:00 GMT

Overview

Concrete tool for excising duplicate byte ranges from a flat binary file provided by the deduplicate-text-datasets repository.

Description

The finish_single_file.py script reads a removal file (output from collect) and the original binary data file, then writes a new file with all specified byte ranges removed. It parses the removal file by skipping lines until it finds the out marker, then reads all subsequent start end pairs. Ranges are processed in reverse order (using a stack-like pop pattern) so the file can be read in a single forward pass.

This is a pure-Python script with no external dependencies beyond sys.

Usage

Use this script as the final step in the single-file deduplication workflow or in cross-dataset deduplication when working with flat binary files. For TFDS-structured datasets, use finish_dedup_wiki40b.py instead.

Code Reference

Source Location

Repository: deduplicate-text-datasets
File: scripts/finish_single_file.py (L1-38)

Signature

python3 scripts/finish_single_file.py <original_file> <remove_file> <output_file>

# Script entry point (not importable as a function)
# sys.argv[1] = original file path
# sys.argv[2] = removal file path (output of collect)
# sys.argv[3] = deduplicated output file path

Import

# CLI script, not importable as a library.
python3 scripts/finish_single_file.py <original> <remove_file> <output>

I/O Contract

Inputs

Name	Type	Required	Description
original	str (positional)	Yes	Path to the original flat binary data file
remove_file	str (positional)	Yes	Path to the byte-range removal file (from collect step); format: "out" header then "start end" pairs
deduped	str (positional)	Yes	Path for the deduplicated output file

Outputs

Name	Type	Description
output_file	Binary file	Deduplicated flat binary file with all specified byte ranges excised

Usage Examples

Basic Deduplication

# Given: /data/wiki40b.test (original) and /tmp/remove.byterange (from collect)
python3 scripts/finish_single_file.py \
    /data/wiki40b.test \
    /tmp/remove.byterange \
    /data/wiki40b.test.deduped

# Verify the output is smaller than the original
ls -la /data/wiki40b.test /data/wiki40b.test.deduped

End-to-End Single File Pipeline

# Complete pipeline from raw file to deduplicated output
DATA=/data/my_corpus.bin

# Step 1: Build suffix array
python3 scripts/make_suffix_array.py $DATA

# Step 2: Find self-similar duplicates
./target/debug/dedup_dataset self-similar \
    --data-file $DATA \
    --length-threshold 100 \
    --cache-dir /tmp/cache \
    --num-threads 8

# Step 3: Collect byte ranges
./target/debug/dedup_dataset collect \
    --data-file $DATA \
    --cache-dir /tmp/cache \
    --length-threshold 100 > /tmp/remove.byterange

# Step 4: Apply removal
python3 scripts/finish_single_file.py $DATA /tmp/remove.byterange ${DATA}.deduped

Related Pages

Implements Principle

Principle:Google_research_Deduplicate_text_datasets_Byte_Range_Removal

Requires Environment

Environment:Google_research_Deduplicate_text_datasets_Python_TFDS_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment