Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Cleanlab Cleanlab Python Core Environment

From Leeroopedia


Knowledge Sources
Domains Data_Centric_AI, Machine_Learning
Last Updated 2026-02-09 19:30 GMT

Overview

Python 3.10+ environment with NumPy, scikit-learn, pandas, tqdm, and termcolor as core dependencies for the cleanlab data-centric AI library.

Description

This environment provides the base runtime for all cleanlab functionality. It is a pure CPU-based Python environment with no GPU requirements. The core dependencies handle numerical computation (NumPy), machine learning models (scikit-learn), data manipulation (pandas), progress display (tqdm), and colored terminal output (termcolor). Optional packages extend cleanlab with Datalab dataset auditing (requires the `datasets` package) and image quality checks (requires CleanVision).

Usage

Use this environment for any cleanlab workflow: classification label issue detection, dataset health analysis, CleanLearning robust training, object detection quality scoring, token classification, multiannotator consensus, and regression label quality. This is the mandatory prerequisite for all Implementation pages in the cleanlab wiki.

System Requirements

Category Requirement Notes
OS Linux, macOS, or Windows All platforms supported; multiprocessing is fastest on Linux
Hardware CPU No GPU required for core functionality
Python >= 3.10 Supports 3.10, 3.11, 3.12, 3.13, 3.14
Disk Minimal Depends on dataset size

Dependencies

System Packages

No system-level packages required beyond a standard Python installation.

Python Packages (Core)

  • `numpy` >= 1.22
  • `scikit-learn` >= 1.1
  • `tqdm` >= 4.53.0
  • `pandas` >= 1.4.0
  • `termcolor` >= 2.4.0

Python Packages (Optional)

  • `psutil` — Enables detection of physical CPU cores for optimal multiprocessing (falls back to logical cores if absent)
  • `matplotlib` >= 3.5.1 — Required for visualization functions in object detection, segmentation, and the `all` extras group
  • `torch` — Required only for experimental PyTorch models (cifar_cnn, mnist_pytorch, coteaching)
  • `scipy` — Used internally by neighbor search and outlier detection

Credentials

No credentials or environment variables are required for core cleanlab functionality.

Quick Install

# Core install
pip install cleanlab

# Install with all optional dependencies
pip install 'cleanlab[all]'

Code Evidence

Python version requirement from `pyproject.toml:47`:

requires-python = ">=3.10"

Core dependencies from `pyproject.toml:11-17`:

dependencies = [
  "numpy>=1.22",
  "scikit-learn>=1.1",
  "tqdm>=4.53.0",
  "pandas>=1.4.0",
  "termcolor>=2.4.0",
]

Optional psutil import with fallback from `cleanlab/filter.py:43-50`:

# psutil is a package used to count physical cores for multiprocessing
# This package is not necessary, because we can always fall back to logical cores as the default
try:
    import psutil
    psutil_exists = True
except ImportError as e:
    psutil_exists = False

Optional tqdm import with warning from `cleanlab/filter.py:33-41`:

try:
    import tqdm.auto as tqdm
    tqdm_exists = True
except ImportError as e:
    tqdm_exists = False
    w = """To see estimated completion times for methods in cleanlab.filter, "pip install tqdm"."""
    warnings.warn(w)

Common Errors

Error Message Cause Solution
`To default n_jobs to the number of physical cores... pip install psutil` psutil not installed; falls back to logical cores `pip install psutil` (optional, affects runtime only)
`To see estimated completion times... pip install tqdm` tqdm not installed `pip install tqdm` (optional, for progress bars)
`try "pip install matplotlib"` matplotlib not installed but visualization function called `pip install matplotlib` or `pip install 'cleanlab[all]'`

Compatibility Notes

  • Python 3.14+: Multiprocessing on Linux changes behavior. Global variable sharing via fork is no longer reliable; data is pickled to subprocesses instead (see `cleanlab/filter.py:363`).
  • Windows/macOS: Multi-label multiprocessing defaults to `n_jobs=1` because spawn-based multiprocessing is much slower for these cases.
  • sklearn >= 1.8.0: `confusion_matrix` with empty inputs raises `ValueError` instead of returning zeros matrix. Cleanlab handles this gracefully (see `cleanlab/count.py:600-604`).

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment