Implementation:Google research Deduplicate text datasets Make Suffix Array
| Knowledge Sources | |
|---|---|
| Domains | Data_Structures, String_Algorithms, Text_Deduplication |
| Last Updated | 2026-02-14 21:00 GMT |
Overview
Concrete tool for building suffix arrays over large binary files provided by the deduplicate-text-datasets repository.
Description
The suffix array construction is a two-layer system: a Python orchestrator (make_suffix_array.py) that partitions the input file, manages parallel jobs, and validates results, plus Rust CLI subcommands (make-part and merge) that perform the actual SA-IS construction and multi-way merge.
The Python script automatically determines parallelism based on file size (1 job for files under 10MB, 4 jobs for 10MB-1GB, 96 jobs for 1-10GB, 100 jobs for 10GB+). Each chunk overlaps by 100,000 bytes (the HACK constant) to handle cross-boundary suffix matches. After building partial suffix arrays, the script merges them using a multi-threaded merge operation and validates the final output file size.
Usage
Import this tool when you need to create a suffix array index for a flat binary data file. This is the required first step before running any deduplication (self-similar, across-similar) or query (count-occurrences) operation. Run it once per dataset; the resulting .table.bin file can be reused.
Code Reference
Source Location
- Repository: deduplicate-text-datasets
- File (orchestrator): scripts/make_suffix_array.py (L1-113)
- File (make-part): src/main.rs (L588-621)
- File (merge): src/main.rs (L1213-1406)
- File (SA-IS core): src/table.rs (L110-117)
Signature
# Python orchestrator (CLI script, not importable)
python3 scripts/make_suffix_array.py <data_path>
# Underlying Rust subcommands called by the orchestrator:
dedup_dataset make-part \
--data-file <path> \
--start-byte <start> \
--end-byte <end>
dedup_dataset merge \
--output-file <output_path> \
--suffix-path <part1> [--suffix-path <part2> ...] \
--num-threads <n>
Import
# This is a CLI tool, not an importable library.
# Requires: cargo build (to produce ./target/debug/dedup_dataset)
# Then run:
python3 scripts/make_suffix_array.py <data_path>
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| data_path | str (positional) | Yes | Path to the flat binary data file (e.g., output from load_dataset.py) |
Outputs
| Name | Type | Description |
|---|---|---|
| <data_path>.table.bin | Binary file | Suffix array as variable-width packed integers; size = ceil(log2(file_size)/8) * file_size bytes |
| <data_path>.part.<start>-<end> | Temp files | Intermediate chunk files (cleaned up after merge) |
| <data_path>.part.<start>-<end>.table.bin | Temp files | Partial suffix arrays per chunk (cleaned up after merge) |
Usage Examples
Build Suffix Array for a Dataset
# Step 1: Build the Rust binary
cargo build
# Step 2: Ensure ulimit is set for merge (many open files)
ulimit -Sn 100000
# Step 3: Build the suffix array
python3 scripts/make_suffix_array.py /data/wiki40b.test
# Output: /data/wiki40b.test.table.bin
# Verify: file size should be a multiple of the data file size
Verify Output
import os
data_path = "/data/wiki40b.test"
table_path = data_path + ".table.bin"
data_size = os.path.getsize(data_path)
table_size = os.path.getsize(table_path)
# The suffix array should be an exact multiple of the data size
assert table_size % data_size == 0, "Suffix array file size is invalid"
# The width factor tells us how many bytes per pointer
width = table_size // data_size
print(f"Pointer width: {width} bytes (addresses up to {256**width} bytes)")
Related Pages
Implements Principle
Requires Environment
- Environment:Google_research_Deduplicate_text_datasets_Rust_Cargo_Build_Environment
- Environment:Google_research_Deduplicate_text_datasets_Python_TFDS_Environment
Uses Heuristic
- Heuristic:Google_research_Deduplicate_text_datasets_HACKSIZE_Overlap_Buffer
- Heuristic:Google_research_Deduplicate_text_datasets_Variable_Width_Pointer_Optimization
- Heuristic:Google_research_Deduplicate_text_datasets_Parallel_Job_Scaling_By_Data_Size
- Heuristic:Google_research_Deduplicate_text_datasets_Ulimit_File_Descriptors_For_Merge