Heuristic:Scikit learn Scikit learn N Jobs Parallelism Tips
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Parallelism |
| Last Updated | 2026-02-08 15:00 GMT |
Overview
Best practices for configuring `n_jobs` parallelism to avoid thread oversubscription between joblib, OpenMP, and BLAS layers.
Description
Scikit-learn has three independent layers of parallelism: joblib (Python multiprocessing for cross-validation folds, grid search candidates, and ensemble tree building), OpenMP (Cython-level threading for pairwise distances and tree node splitting), and BLAS (linear algebra operations via OpenBLAS/MKL). Each layer independently spawns threads or processes. When all three run simultaneously, threads can exceed available CPU cores (oversubscription), causing dramatic performance degradation due to context switching overhead. The `n_jobs` parameter controls only the joblib layer.
Usage
Apply this heuristic when using `n_jobs` parameter in Cross_Validate, GridSearchCV_Init, BaseForest_Fit, and Permutation_Importance. Particularly important in containerized environments (Docker, Kubernetes) where CPU limits may not match visible cores.
The Insight (Rule of Thumb)
- Action: Set `OMP_NUM_THREADS=1` when using `n_jobs=-1` to avoid oversubscription between joblib and OpenMP.
- Value: `n_jobs=-1` uses all available cores via joblib; combine with `OMP_NUM_THREADS=1` for best performance.
- Trade-off: `n_jobs=-1` increases memory usage linearly (each worker gets a copy of data). For large datasets, use fewer jobs.
- Anti-pattern: Using `joblib.delayed` instead of `sklearn.utils.parallel.delayed` — this loses scikit-learn config propagation to workers.
Reasoning
When `n_jobs=4` and `OMP_NUM_THREADS=4`, a RandomForest fit can spawn 4 joblib processes, each running 4 OpenMP threads, for 16 total threads on potentially 4 cores. This causes massive context switching overhead. Setting `OMP_NUM_THREADS=1` when using joblib parallelism ensures one thread per process, matching the intended parallelism level. Scikit-learn's custom `Parallel` class (in `sklearn.utils.parallel`) wraps joblib.Parallel to propagate thread-local configuration to workers, which is lost if you use raw `joblib.delayed`.
Code Evidence
Sklearn's custom Parallel class from `sklearn/utils/parallel.py:41-52`:
class Parallel(joblib.Parallel):
"""Tweak of :class:`joblib.Parallel` that propagates the scikit-learn
configuration.
This subclass of :class:`joblib.Parallel` ensures that the active
configuration (thread-local) of scikit-learn is propagated to the
parallel workers for the duration of the execution of the parallel
tasks.
"""
Warning for wrong delayed usage from `sklearn/utils/parallel.py:29-37`:
warnings.warn(
(
"`sklearn.utils.parallel.Parallel` needs to be used in "
"conjunction with `sklearn.utils.parallel.delayed` instead of "
"`joblib.delayed` to correctly propagate the scikit-learn "
"configuration to the joblib workers."
),
UserWarning,
)
Thread oversubscription prevention in test runner from `sklearn/conftest.py:334`:
xdist_worker_count = environ.get("PYTEST_XDIST_WORKER_COUNT")
# When set, OpenMP and BLAS thread limits are adjusted to
# cpu_count // worker_count to prevent oversubscription
Related Pages
- Implementation:Scikit_learn_Scikit_learn_BaseForest_Fit
- Implementation:Scikit_learn_Scikit_learn_Cross_Validate
- Implementation:Scikit_learn_Scikit_learn_BaseSearchCV_Fit
- Implementation:Scikit_learn_Scikit_learn_Permutation_Importance
- Principle:Scikit_learn_Scikit_learn_Ensemble_Training
- Principle:Scikit_learn_Scikit_learn_Cross_Validation