Heuristic:Google research Deduplicate text datasets Parallel Job Scaling By Data Size

Knowledge Sources	deduplicate-text-datasets Internal
Domains	Optimization, Text_Deduplication
Last Updated	2026-02-14 21:00 GMT

Overview

Adaptive parallelism strategy that scales the number of suffix array construction jobs based on input data size, from 1 job for small files to 100 jobs for 10GB+ files.

Description

The `make_suffix_array.py` script dynamically adjusts the total number of parallel jobs and the concurrency level based on the size of the input data file. This prevents over-parallelization on small files (where overhead dominates) and under-parallelization on large files (where throughput matters). The scaling tiers are empirically tuned: very large files (>10GB) use 100 total jobs with 20 running concurrently, medium files (>1GB) use 96 jobs all at once, small files (>10MB) use 4 jobs, and tiny files use a single sequential job.

Usage

This heuristic is applied automatically when running `scripts/make_suffix_array.py`. Understanding the scaling tiers is useful when:

Tuning for specific hardware (e.g., machines with fewer cores)
Debugging why suffix array construction is slow on a particular file size
Adjusting for extremely large datasets where 100 jobs with 20 concurrent may not be optimal

The Insight (Rule of Thumb)

Action: Scale parallel job count based on data file size.
Value:
- >10GB: 100 total jobs, 20 concurrent
- >1GB: 96 total jobs, 96 concurrent
- >10MB: 4 total jobs, 4 concurrent
- <=10MB: 1 job (sequential)
Trade-off: For very large files (>10GB), concurrency is limited to 20 to avoid memory exhaustion since each job loads its chunk into memory. For medium files, full parallelism is used because per-job memory is manageable.
Note: The merge step always uses `mp.cpu_count()` threads regardless of this heuristic.

Reasoning

Each `make-part` job loads its chunk of the data file into memory and builds a suffix array for it. Memory consumption per job is approximately `chunk_size * constant` where the constant accounts for the SA-IS algorithm's working memory. For a 300GB file with 100 jobs, each chunk is ~3GB, and 20 concurrent jobs would use ~60GB of working memory plus the overhead. Running all 100 concurrently would require prohibitive memory.

For medium files (1-10GB), chunks are small enough (10-100MB each) that all 96 jobs can run simultaneously without memory issues, and the overhead of job management is amortized across many parallel jobs.

The README advises: "If you want to deduplicate small (<10GB) datasets, it should work on any modern machine with ~16GB of RAM and a few CPU cores. If you want to deduplicate something the size of C4 (~300GB) you will want a machine with as many cores as you can get (we used 96 cores) and >600GB of RAM."

Code Evidence

Job scaling tiers from `scripts/make_suffix_array.py:28-39`:

if data_size > 10e9:
    total_jobs = 100
    jobs_at_once = 20
elif data_size > 1e9:
    total_jobs = 96
    jobs_at_once = 96
elif data_size > 10e6:
    total_jobs = 4
    jobs_at_once = 4
else:
    total_jobs = 1
    jobs_at_once = 1

Merge step using all available CPU cores from `scripts/make_suffix_array.py:94-95`:

pipe = os.popen("./target/debug/dedup_dataset merge --output-file %s --suffix-path %s --num-threads %d"%("tmp/out.table.bin", torun, mp.cpu_count()))

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment