Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Snorkel team Snorkel Dask Distributed

From Leeroopedia
Knowledge Sources
Domains Infrastructure, Distributed_Computing
Last Updated 2026-02-14 21:00 GMT

Overview

Optional Dask environment with `dask[dataframe]` >= 2020.12.0 and `distributed` >= 2023.7.0 for parallel LF and SF application across multiple processes.

Description

This environment enables parallel execution of labeling functions and slicing functions using Dask. The DaskLFApplier and DaskSFApplier partition a pandas DataFrame into `n_parallel` partitions, apply LFs/SFs in parallel across processes, and reassemble the results.

Dask support also requires `dill` for serialization of user-defined functions across process boundaries.

Usage

Use this environment when applying labeling or slicing functions to large datasets where single-process execution is too slow. The default parallelism is 2 processes. For single-process execution, use PandasLFApplier/PandasSFApplier instead.

Important: Dask is NOT guarded by try/except ImportError. Importing `DaskLFApplier` will fail immediately if Dask is not installed.

System Requirements

Category Requirement Notes
Python >= 3.11 Inherited from core Snorkel requirement
CPU Multi-core recommended n_parallel should not exceed available cores

Dependencies

Python Packages

  • `dask[dataframe]` >= 2020.12.0
  • `distributed` >= 2023.7.0
  • `dill` >= 0.3.0

Credentials

No credentials required.

Quick Install

pip install "dask[dataframe]>=2020.12.0" "distributed>=2023.7.0" "dill>=0.3.0"

Code Evidence

Direct import without guard from `labeling/apply/dask.py:6-7`:

from dask import dataframe as dd
from dask.distributed import Client

Minimum parallelism enforcement from `labeling/apply/dask.py:76-82`:

        if n_parallel < 2:
            raise ValueError(
                "n_parallel should be >= 2. "
                "For single process Pandas, use PandasLFApplier."
            )
        df = dd.from_pandas(df, npartitions=n_parallel)

Common Errors

Error Message Cause Solution
`ModuleNotFoundError: No module named 'dask'` Dask not installed `pip install "dask[dataframe]>=2020.12.0"`
`ValueError: n_parallel should be >= 2` n_parallel set to 1 or 0 Use `PandasLFApplier` for single-process, or set `n_parallel >= 2`
`ModuleNotFoundError: No module named 'dill'` Dill not installed for serialization `pip install dill>=0.3.0`

Compatibility Notes

  • No ImportError guard: Importing `from snorkel.labeling.apply.dask import DaskLFApplier` will crash if Dask is not installed.
  • Default scheduler is "processes": Uses multiprocessing by default. Can be changed to "threads" or "synchronous" for debugging.
  • n_parallel recommendation: Should be no more than the number of cores on the running machine.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment