Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Cleanlab Cleanlab Multiprocessing Platform Strategy

From Leeroopedia



Knowledge Sources
Domains Performance, Infrastructure
Last Updated 2026-02-09 19:30 GMT

Overview

Platform-aware multiprocessing configuration that adjusts `n_jobs`, data sharing strategy, and process start method based on OS, Python version, and dataset size.

Description

Cleanlab's `find_label_issues` function parallelizes per-class pruning across CPU cores. The optimal multiprocessing strategy depends on the operating system (fork vs spawn process start methods), the Python version (3.14+ changes fork behavior on Linux), and the dataset size. This heuristic encodes the tribal knowledge about when multiprocessing helps vs hurts performance.

Usage

This heuristic is automatically applied in find_label_issues when `n_jobs` is not explicitly set. Understanding it helps when:

  • Debugging multiprocessing errors (e.g., "pred_probs is not defined")
  • Optimizing runtime for large datasets
  • Running on Windows/macOS where defaults differ from Linux

The Insight (Rule of Thumb)

  • Action 1 — n_jobs selection:
    • On Linux: default to physical CPU core count (via psutil) or logical core count as fallback
    • On Windows/macOS with multi_label=True: force `n_jobs=1` because spawn-based multiprocessing is much slower
    • On Windows/macOS with multi_label=False: use physical or logical cores
  • Action 2 — Data sharing strategy:
    • When `n_jobs=1`: use global variables (no multiprocessing overhead)
    • On Linux with Python < 3.14: use global variables with copy-on-write (fork semantics)
    • On Windows/macOS or Python >= 3.14: pickle data to subprocesses via input args
  • Action 3 — Big dataset detection:
    • A dataset is "big" when `K * len(labels) > 1e8` (100 million elements)
    • Big datasets show tqdm progress bars (if installed) during multiprocessing
  • Trade-off: Multiprocessing adds overhead from process creation and data serialization. For small datasets or on Windows/macOS with multi-label data, single-threaded execution is faster.

Reasoning

Linux uses `fork()` to create child processes, which gives copy-on-write access to parent memory without serialization. This makes data sharing via global variables efficient. Windows and macOS use `spawn()`, which requires pickling all data to child processes, adding significant overhead especially for large arrays.

Python 3.14 deprecates the reliability of fork-based global variable sharing even on Linux, requiring the pickle-based approach everywhere.

The `psutil.cpu_count(logical=False)` preference for physical cores avoids oversubscription on hyperthreaded CPUs, where logical core count can be 2x physical cores but CPU-bound tasks do not benefit from hyperthreading.

Code Evidence:

Platform-specific n_jobs selection from `cleanlab/filter.py:260-281`:

# On Windows/macOS, when multi_label is True, multiprocessing is much slower
# even for faily large input arrays, so we default to n_jobs=1 in this case
os_name = platform.system()
if n_jobs is None:
    if multi_label and os_name != "Linux":
        n_jobs = 1
    else:
        if psutil_exists:
            n_jobs = psutil.cpu_count(logical=False)  # physical cores
        elif big_dataset:
            print(
                "To default `n_jobs` to the number of physical cores..."
            )
        if not n_jobs:
            n_jobs = multiprocessing.cpu_count()

Data sharing strategy from `cleanlab/filter.py:358-365`:

# On Linux with Python <3.14, multiprocessing is started with fork,
# so data can be shared with global variables + COW
# On Window/macOS, processes are started with spawn,
# so data will need to be pickled to the subprocesses through input args
# In Python 3.14+, global variable sharing is no longer reliable even on Linux
chunksize = max(1, K // n_jobs)
use_global_vars = n_jobs == 1 or (os_name == "Linux" and sys.version_info < (3, 14))

Big dataset threshold from `cleanlab/filter.py:257-258`:

# Boolean set to true if dataset is large
big_dataset = K * len(labels) > 1e8

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment