Heuristic:Google research Deduplicate text datasets Ulimit File Descriptors For Merge
| Knowledge Sources | |
|---|---|
| Domains | Debugging, Infrastructure |
| Last Updated | 2026-02-14 21:00 GMT |
Overview
The suffix array merge step requires raising the OS file descriptor limit to at least 100,000 via `ulimit -Sn 100000` to avoid "too many open files" failures.
Description
The parallel suffix array merge step opens many file descriptors simultaneously: one for each partial suffix array chunk, plus output files, plus the data file itself. With 96-100 chunks (typical for large datasets), the default Linux file descriptor limit (often 1024) is quickly exceeded. The `make_suffix_array.py` script explicitly checks for merge failure and instructs the user to run `ulimit -Sn 100000`. The README also warns about this: "if you get an error that you have too many open files... You should run `ulimit -Sn 1000000` to fix the error."
Usage
Apply this heuristic before running any suffix array construction on medium-to-large files. The merge step will silently fail or crash if the file descriptor limit is too low. The README advises doing this preemptively: "You might want to do this preemptively before hitting this crash after hour ten of the job."
The Insight (Rule of Thumb)
- Action: Run `ulimit -Sn 100000` (or higher) in your shell before starting the pipeline.
- Value: At least 100,000 file descriptors. The README suggests 1,000,000 for extra safety.
- Trade-off: No performance cost. The only risk is that the OS may reject very high values if the hard limit (`ulimit -Hn`) is not also raised (requires root).
- Timing: Must be set before the merge step begins. Can be set at the start of the session for safety.
Reasoning
The merge algorithm opens all partial suffix arrays simultaneously to perform a multi-way merge sort. With 96 chunks (the default for 1-10GB files), this means 96+ open file handles just for suffix array inputs, plus output file handles for each thread. The default Linux soft limit of 1024 file descriptors is far too low. The Python script detects this failure mode specifically and prints a diagnostic message, but by the time the error occurs, the `make-part` phase (which can take hours) has already completed, making the failure costly.
Code Evidence
Merge failure check and diagnostic from `scripts/make_suffix_array.py:97-100`:
output = pipe.read()
if pipe.close() is not None:
print("Something went wrong with merging.")
print("Please check that you ran with ulimit -Sn 100000")
exit(1)
README warning from `README.md:88`:
(When running on larger files, if you get an error that you have too many open files,
that's because this script opens lots of files. You should run `ulimit -Sn 1000000` to
"fix" the error. You might want to do this preemptively before hitting this crash after
hour ten of the job.)