Heuristic:Cleanlab Cleanlab Multiprocessing Platform Strategy
| Knowledge Sources | |
|---|---|
| Domains | Performance, Infrastructure |
| Last Updated | 2026-02-09 19:30 GMT |
Overview
Platform-aware multiprocessing configuration that adjusts `n_jobs`, data sharing strategy, and process start method based on OS, Python version, and dataset size.
Description
Cleanlab's `find_label_issues` function parallelizes per-class pruning across CPU cores. The optimal multiprocessing strategy depends on the operating system (fork vs spawn process start methods), the Python version (3.14+ changes fork behavior on Linux), and the dataset size. This heuristic encodes the tribal knowledge about when multiprocessing helps vs hurts performance.
Usage
This heuristic is automatically applied in find_label_issues when `n_jobs` is not explicitly set. Understanding it helps when:
- Debugging multiprocessing errors (e.g., "pred_probs is not defined")
- Optimizing runtime for large datasets
- Running on Windows/macOS where defaults differ from Linux
The Insight (Rule of Thumb)
- Action 1 — n_jobs selection:
- On Linux: default to physical CPU core count (via psutil) or logical core count as fallback
- On Windows/macOS with multi_label=True: force `n_jobs=1` because spawn-based multiprocessing is much slower
- On Windows/macOS with multi_label=False: use physical or logical cores
- Action 2 — Data sharing strategy:
- When `n_jobs=1`: use global variables (no multiprocessing overhead)
- On Linux with Python < 3.14: use global variables with copy-on-write (fork semantics)
- On Windows/macOS or Python >= 3.14: pickle data to subprocesses via input args
- Action 3 — Big dataset detection:
- A dataset is "big" when `K * len(labels) > 1e8` (100 million elements)
- Big datasets show tqdm progress bars (if installed) during multiprocessing
- Trade-off: Multiprocessing adds overhead from process creation and data serialization. For small datasets or on Windows/macOS with multi-label data, single-threaded execution is faster.
Reasoning
Linux uses `fork()` to create child processes, which gives copy-on-write access to parent memory without serialization. This makes data sharing via global variables efficient. Windows and macOS use `spawn()`, which requires pickling all data to child processes, adding significant overhead especially for large arrays.
Python 3.14 deprecates the reliability of fork-based global variable sharing even on Linux, requiring the pickle-based approach everywhere.
The `psutil.cpu_count(logical=False)` preference for physical cores avoids oversubscription on hyperthreaded CPUs, where logical core count can be 2x physical cores but CPU-bound tasks do not benefit from hyperthreading.
Code Evidence:
Platform-specific n_jobs selection from `cleanlab/filter.py:260-281`:
# On Windows/macOS, when multi_label is True, multiprocessing is much slower
# even for faily large input arrays, so we default to n_jobs=1 in this case
os_name = platform.system()
if n_jobs is None:
if multi_label and os_name != "Linux":
n_jobs = 1
else:
if psutil_exists:
n_jobs = psutil.cpu_count(logical=False) # physical cores
elif big_dataset:
print(
"To default `n_jobs` to the number of physical cores..."
)
if not n_jobs:
n_jobs = multiprocessing.cpu_count()
Data sharing strategy from `cleanlab/filter.py:358-365`:
# On Linux with Python <3.14, multiprocessing is started with fork,
# so data can be shared with global variables + COW
# On Window/macOS, processes are started with spawn,
# so data will need to be pickled to the subprocesses through input args
# In Python 3.14+, global variable sharing is no longer reliable even on Linux
chunksize = max(1, K // n_jobs)
use_global_vars = n_jobs == 1 or (os_name == "Linux" and sys.version_info < (3, 14))
Big dataset threshold from `cleanlab/filter.py:257-258`:
# Boolean set to true if dataset is large
big_dataset = K * len(labels) > 1e8