Implementation:Google research Deduplicate text datasets Make Suffix Array

Knowledge Sources	deduplicate-text-datasets Deduplicating Training Data Makes Language Models Better
Domains	Data_Structures, String_Algorithms, Text_Deduplication
Last Updated	2026-02-14 21:00 GMT

Overview

Concrete tool for building suffix arrays over large binary files provided by the deduplicate-text-datasets repository.

Description

The suffix array construction is a two-layer system: a Python orchestrator (make_suffix_array.py) that partitions the input file, manages parallel jobs, and validates results, plus Rust CLI subcommands (make-part and merge) that perform the actual SA-IS construction and multi-way merge.

The Python script automatically determines parallelism based on file size (1 job for files under 10MB, 4 jobs for 10MB-1GB, 96 jobs for 1-10GB, 100 jobs for 10GB+). Each chunk overlaps by 100,000 bytes (the HACK constant) to handle cross-boundary suffix matches. After building partial suffix arrays, the script merges them using a multi-threaded merge operation and validates the final output file size.

Usage

Import this tool when you need to create a suffix array index for a flat binary data file. This is the required first step before running any deduplication (self-similar, across-similar) or query (count-occurrences) operation. Run it once per dataset; the resulting .table.bin file can be reused.

Code Reference

Source Location

Repository: deduplicate-text-datasets
File (orchestrator): scripts/make_suffix_array.py (L1-113)
File (make-part): src/main.rs (L588-621)
File (merge): src/main.rs (L1213-1406)
File (SA-IS core): src/table.rs (L110-117)

Signature

# Python orchestrator (CLI script, not importable)
python3 scripts/make_suffix_array.py <data_path>

# Underlying Rust subcommands called by the orchestrator:
dedup_dataset make-part \
    --data-file <path> \
    --start-byte <start> \
    --end-byte <end>

dedup_dataset merge \
    --output-file <output_path> \
    --suffix-path <part1> [--suffix-path <part2> ...] \
    --num-threads <n>

Import

# This is a CLI tool, not an importable library.
# Requires: cargo build (to produce ./target/debug/dedup_dataset)
# Then run:
python3 scripts/make_suffix_array.py <data_path>

I/O Contract

Inputs

Name	Type	Required	Description
data_path	str (positional)	Yes	Path to the flat binary data file (e.g., output from load_dataset.py)

Outputs

Name	Type	Description
<data_path>.table.bin	Binary file	Suffix array as variable-width packed integers; size = ceil(log2(file_size)/8) * file_size bytes
<data_path>.part.<start>-<end>	Temp files	Intermediate chunk files (cleaned up after merge)
<data_path>.part.<start>-<end>.table.bin	Temp files	Partial suffix arrays per chunk (cleaned up after merge)

Usage Examples

Build Suffix Array for a Dataset

# Step 1: Build the Rust binary
cargo build

# Step 2: Ensure ulimit is set for merge (many open files)
ulimit -Sn 100000

# Step 3: Build the suffix array
python3 scripts/make_suffix_array.py /data/wiki40b.test

# Output: /data/wiki40b.test.table.bin
# Verify: file size should be a multiple of the data file size

Verify Output

import os

data_path = "/data/wiki40b.test"
table_path = data_path + ".table.bin"

data_size = os.path.getsize(data_path)
table_size = os.path.getsize(table_path)

# The suffix array should be an exact multiple of the data size
assert table_size % data_size == 0, "Suffix array file size is invalid"

# The width factor tells us how many bytes per pointer
width = table_size // data_size
print(f"Pointer width: {width} bytes (addresses up to {256**width} bytes)")

Related Pages

Implements Principle

Principle:Google_research_Deduplicate_text_datasets_Suffix_Array_Construction

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment