Heuristic:Google research Deduplicate text datasets HACKSIZE Overlap Buffer
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Text_Deduplication |
| Last Updated | 2026-02-14 21:00 GMT |
Overview
Cross-boundary overlap buffer of 100,000 bytes (HACKSIZE) ensures correct suffix array merging across chunked partitions.
Description
When building suffix arrays in parallel, the input data file is split into contiguous chunks that are processed independently. The problem is that a duplicate substring might span the boundary between two chunks, causing it to be missed during the merge step. To solve this, each chunk overlaps with the next by HACKSIZE bytes (100,000). This means `S_{i}[-HACKSIZE:]` equals `S_{i+1}[:HACKSIZE]`. During the merge, the overlapping region is stripped away, but its presence ensures that any match shorter than HACKSIZE bytes is correctly captured even at chunk boundaries.
Usage
This heuristic applies whenever the parallel suffix array construction path is used (i.e., `make_suffix_array.py` with chunked `make-part` + `merge`). It is critical for correctness: if HACKSIZE is smaller than the longest potential duplicate, matches at chunk boundaries may be silently missed. The default value of 100,000 is sufficient for typical text deduplication where duplicates are usually much shorter.
The Insight (Rule of Thumb)
- Action: Set the overlap buffer (HACKSIZE) to at least the maximum expected duplicate length.
- Value: 100,000 bytes (default). This constant is declared in both `src/main.rs:1216` and `scripts/make_suffix_array.py:23` and must be kept in sync.
- Trade-off: Larger HACKSIZE means more redundant data per chunk (more disk I/O and memory), but ensures no cross-boundary matches are missed. Reducing it risks silent correctness failures.
- Warning: The code author explicitly notes: "I did call it hacksize after all..... In practice this works. It may not for your use case if there are long duplicates."
Reasoning
The parallel suffix array merge algorithm processes each chunk independently. Without overlap, a duplicate substring that starts in chunk N and ends in chunk N+1 would not appear in either chunk's suffix array as a complete entry. The 100,000-byte overlap guarantees that any substring of length up to 100,000 that crosses a boundary is fully contained in at least one chunk. For text deduplication, the typical `--length-threshold` is 50-200 bytes, so 100,000 provides a very large safety margin.
The constant must be identical in both the Python orchestrator (which determines chunk boundaries) and the Rust merge function (which strips the overlap). A mismatch would produce incorrect suffix arrays.
From the code comment in `src/main.rs:1208-1211`:
So in practice we make sure that S_{i}[-HACKSIZE:] === S_{i+1}[:HACKSIZE].
As long as HACKSIZE is longer than the longest potential match, everything
will work out correctly. (I did call it hacksize after all.....)
In practice this works. It may not for your use case if there are long duplicates.
Code Evidence
Rust constant from `src/main.rs:1216`:
const HACKSIZE:usize=100000;
Python constant from `scripts/make_suffix_array.py:23`:
HACK = 100000
Chunk boundary calculation with overlap from `scripts/make_suffix_array.py:47`:
s, e = i*S, min((i+1)*S+HACK, data_size)
Overlap stripping during merge from `src/main.rs:1232`:
let texts_len:Vec<usize> = texts.iter().enumerate().map(|(i,x)| x.len() - (if i+1 == texts.len() {0} else {HACKSIZE})).collect();