Heuristic:Google research Deduplicate text datasets Ulimit File Descriptors For Merge

Knowledge Sources	deduplicate-text-datasets Internal
Domains	Debugging, Infrastructure
Last Updated	2026-02-14 21:00 GMT

Overview

The suffix array merge step requires raising the OS file descriptor limit to at least 100,000 via `ulimit -Sn 100000` to avoid "too many open files" failures.

Description

The parallel suffix array merge step opens many file descriptors simultaneously: one for each partial suffix array chunk, plus output files, plus the data file itself. With 96-100 chunks (typical for large datasets), the default Linux file descriptor limit (often 1024) is quickly exceeded. The `make_suffix_array.py` script explicitly checks for merge failure and instructs the user to run `ulimit -Sn 100000`. The README also warns about this: "if you get an error that you have too many open files... You should run `ulimit -Sn 1000000` to fix the error."

Usage

Apply this heuristic before running any suffix array construction on medium-to-large files. The merge step will silently fail or crash if the file descriptor limit is too low. The README advises doing this preemptively: "You might want to do this preemptively before hitting this crash after hour ten of the job."

The Insight (Rule of Thumb)

Action: Run `ulimit -Sn 100000` (or higher) in your shell before starting the pipeline.
Value: At least 100,000 file descriptors. The README suggests 1,000,000 for extra safety.
Trade-off: No performance cost. The only risk is that the OS may reject very high values if the hard limit (`ulimit -Hn`) is not also raised (requires root).
Timing: Must be set before the merge step begins. Can be set at the start of the session for safety.

Reasoning

The merge algorithm opens all partial suffix arrays simultaneously to perform a multi-way merge sort. With 96 chunks (the default for 1-10GB files), this means 96+ open file handles just for suffix array inputs, plus output file handles for each thread. The default Linux soft limit of 1024 file descriptors is far too low. The Python script detects this failure mode specifically and prints a diagnostic message, but by the time the error occurs, the `make-part` phase (which can take hours) has already completed, making the failure costly.

Code Evidence

Merge failure check and diagnostic from `scripts/make_suffix_array.py:97-100`:

output = pipe.read()
if pipe.close() is not None:
    print("Something went wrong with merging.")
    print("Please check that you ran with ulimit -Sn 100000")
    exit(1)

README warning from `README.md:88`:

(When running on larger files, if you get an error that you have too many open files,
that's because this script opens lots of files. You should run `ulimit -Sn 1000000` to
"fix" the error. You might want to do this preemptively before hitting this crash after
hour ten of the job.)

Related Pages

Implementation:Google_research_Deduplicate_text_datasets_Make_Suffix_Array

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment