Environment:Snorkel team Snorkel Dask Distributed

Knowledge Sources	Snorkel Dask
Domains	Infrastructure, Distributed_Computing
Last Updated	2026-02-14 21:00 GMT

Overview

Optional Dask environment with `dask[dataframe]` >= 2020.12.0 and `distributed` >= 2023.7.0 for parallel LF and SF application across multiple processes.

Description

This environment enables parallel execution of labeling functions and slicing functions using Dask. The DaskLFApplier and DaskSFApplier partition a pandas DataFrame into `n_parallel` partitions, apply LFs/SFs in parallel across processes, and reassemble the results.

Dask support also requires `dill` for serialization of user-defined functions across process boundaries.

Usage

Use this environment when applying labeling or slicing functions to large datasets where single-process execution is too slow. The default parallelism is 2 processes. For single-process execution, use PandasLFApplier/PandasSFApplier instead.

Important: Dask is NOT guarded by try/except ImportError. Importing `DaskLFApplier` will fail immediately if Dask is not installed.

System Requirements

Category	Requirement	Notes
Python	>= 3.11	Inherited from core Snorkel requirement
CPU	Multi-core recommended	n_parallel should not exceed available cores

Dependencies

Python Packages

`dask[dataframe]` >= 2020.12.0
`distributed` >= 2023.7.0
`dill` >= 0.3.0

Credentials

No credentials required.

Quick Install

pip install "dask[dataframe]>=2020.12.0" "distributed>=2023.7.0" "dill>=0.3.0"

Code Evidence

Direct import without guard from `labeling/apply/dask.py:6-7`:

from dask import dataframe as dd
from dask.distributed import Client

Minimum parallelism enforcement from `labeling/apply/dask.py:76-82`:

        if n_parallel < 2:
            raise ValueError(
                "n_parallel should be >= 2. "
                "For single process Pandas, use PandasLFApplier."
            )
        df = dd.from_pandas(df, npartitions=n_parallel)

Common Errors

Error Message	Cause	Solution
`ModuleNotFoundError: No module named 'dask'`	Dask not installed	`pip install "dask[dataframe]>=2020.12.0"`
`ValueError: n_parallel should be >= 2`	n_parallel set to 1 or 0	Use `PandasLFApplier` for single-process, or set `n_parallel >= 2`
`ModuleNotFoundError: No module named 'dill'`	Dill not installed for serialization	`pip install dill>=0.3.0`

Compatibility Notes

No ImportError guard: Importing `from snorkel.labeling.apply.dask import DaskLFApplier` will crash if Dask is not installed.
Default scheduler is "processes": Uses multiprocessing by default. Can be changed to "threads" or "synchronous" for debugging.
n_parallel recommendation: Should be no more than the number of cores on the running machine.

Related Pages

Implementation:Snorkel_team_Snorkel_PandasLFApplier_Apply

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment