Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Cleanlab Cleanlab Datalab Dependencies

From Leeroopedia


Knowledge Sources
Domains Data_Centric_AI, Dataset_Auditing
Last Updated 2026-02-09 19:30 GMT

Overview

Optional dependency environment extending the core cleanlab install with the HuggingFace `datasets` package, required for Datalab automated dataset auditing.

Description

The Datalab module provides automated multi-issue detection across a dataset (label errors, outliers, duplicates, class imbalance, non-IID violations, null values, underperforming groups, data valuation). It requires the HuggingFace `datasets` library (>= 2.7.0) for its internal data storage and type system. When the `datasets` package is not installed, Datalab is replaced with a `DatalabUnavailable` stub that raises `ImportError` with installation instructions on any access attempt.

Usage

Use this environment when running the Datalab automated dataset audit workflow. Install via the `[datalab]` extras group. This is the prerequisite for the Datalab_Init, Datalab_Find_Issues, Datalab_Report, Datalab_Get_Issues, and Datalab_Get_Issue_Summary implementations.

System Requirements

Category Requirement Notes
OS Linux, macOS, or Windows Same as core cleanlab
Hardware CPU No GPU required
Python >= 3.10 Same as core cleanlab
Disk Moderate HuggingFace datasets may cache data to disk

Dependencies

System Packages

No additional system-level packages required beyond core cleanlab.

Python Packages

Credentials

No credentials required. HuggingFace datasets can load local or public datasets without authentication.

Quick Install

pip install 'cleanlab[datalab]'

Code Evidence

Datalab import gating from `cleanlab/__init__.py:32-41`:

def _datalab_import_factory():
    try:
        from .datalab.datalab import Datalab as _Datalab
        return _Datalab
    except ImportError:
        return DatalabUnavailable(
            "Datalab is not available due to missing dependencies. "
            "To install Datalab, run `pip install 'cleanlab[datalab]'`."
        )

Datasets package import check from `cleanlab/datalab/internal/data.py:8-15`:

try:
    import datasets
except ImportError as error:
    raise ImportError(
        "Cannot import datasets package. "
        "Please install it and try again, or just install cleanlab with "
        "all optional dependencies via: `pip install 'cleanlab[all]'`"
    ) from error

Datasets 4.0.0+ compatibility handling from `cleanlab/datalab/internal/data.py:22-29`:

# Import Column types for compatibility with datasets 4.0.0+
try:
    from datasets.arrow_dataset import Column
    from datasets.iterable_dataset import IterableColumn
except ImportError:
    # For backwards compatibility with older datasets versions
    Column = None
    IterableColumn = None

Optional dependency definition from `setup.py:24-28`:

DATALAB_REQUIRE = [
    # Mainly for Datalab's data storage class.
    # Still some type hints that require datasets
    "datasets>=2.7.0",
]

Common Errors

Error Message Cause Solution
`Datalab is not available due to missing dependencies` `datasets` package not installed `pip install 'cleanlab[datalab]'`
`IssueManager is not available due to missing dependencies for Datalab` `datasets` package not installed `pip install 'cleanlab[datalab]'`
`Cannot import datasets package` Attempting to use Datalab internals without `datasets` `pip install 'cleanlab[all]'`

Compatibility Notes

  • datasets >= 4.0.0: Introduced new `Column` and `IterableColumn` types. Cleanlab handles both old and new versions gracefully via try/except import.
  • datasets < 2.7.0: Not supported. May cause subtle type or API errors.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment