Heuristic:Google research Deduplicate text datasets Parallel Job Scaling By Data Size
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Text_Deduplication |
| Last Updated | 2026-02-14 21:00 GMT |
Overview
Adaptive parallelism strategy that scales the number of suffix array construction jobs based on input data size, from 1 job for small files to 100 jobs for 10GB+ files.
Description
The `make_suffix_array.py` script dynamically adjusts the total number of parallel jobs and the concurrency level based on the size of the input data file. This prevents over-parallelization on small files (where overhead dominates) and under-parallelization on large files (where throughput matters). The scaling tiers are empirically tuned: very large files (>10GB) use 100 total jobs with 20 running concurrently, medium files (>1GB) use 96 jobs all at once, small files (>10MB) use 4 jobs, and tiny files use a single sequential job.
Usage
This heuristic is applied automatically when running `scripts/make_suffix_array.py`. Understanding the scaling tiers is useful when:
- Tuning for specific hardware (e.g., machines with fewer cores)
- Debugging why suffix array construction is slow on a particular file size
- Adjusting for extremely large datasets where 100 jobs with 20 concurrent may not be optimal
The Insight (Rule of Thumb)
- Action: Scale parallel job count based on data file size.
- Value:
- >10GB: 100 total jobs, 20 concurrent
- >1GB: 96 total jobs, 96 concurrent
- >10MB: 4 total jobs, 4 concurrent
- <=10MB: 1 job (sequential)
- Trade-off: For very large files (>10GB), concurrency is limited to 20 to avoid memory exhaustion since each job loads its chunk into memory. For medium files, full parallelism is used because per-job memory is manageable.
- Note: The merge step always uses `mp.cpu_count()` threads regardless of this heuristic.
Reasoning
Each `make-part` job loads its chunk of the data file into memory and builds a suffix array for it. Memory consumption per job is approximately `chunk_size * constant` where the constant accounts for the SA-IS algorithm's working memory. For a 300GB file with 100 jobs, each chunk is ~3GB, and 20 concurrent jobs would use ~60GB of working memory plus the overhead. Running all 100 concurrently would require prohibitive memory.
For medium files (1-10GB), chunks are small enough (10-100MB each) that all 96 jobs can run simultaneously without memory issues, and the overhead of job management is amortized across many parallel jobs.
The README advises: "If you want to deduplicate small (<10GB) datasets, it should work on any modern machine with ~16GB of RAM and a few CPU cores. If you want to deduplicate something the size of C4 (~300GB) you will want a machine with as many cores as you can get (we used 96 cores) and >600GB of RAM."
Code Evidence
Job scaling tiers from `scripts/make_suffix_array.py:28-39`:
if data_size > 10e9:
total_jobs = 100
jobs_at_once = 20
elif data_size > 1e9:
total_jobs = 96
jobs_at_once = 96
elif data_size > 10e6:
total_jobs = 4
jobs_at_once = 4
else:
total_jobs = 1
jobs_at_once = 1
Merge step using all available CPU cores from `scripts/make_suffix_array.py:94-95`:
pipe = os.popen("./target/debug/dedup_dataset merge --output-file %s --suffix-path %s --num-threads %d"%("tmp/out.table.bin", torun, mp.cpu_count()))