Environment:Snorkel team Snorkel Dask Distributed
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Distributed_Computing |
| Last Updated | 2026-02-14 21:00 GMT |
Overview
Optional Dask environment with `dask[dataframe]` >= 2020.12.0 and `distributed` >= 2023.7.0 for parallel LF and SF application across multiple processes.
Description
This environment enables parallel execution of labeling functions and slicing functions using Dask. The DaskLFApplier and DaskSFApplier partition a pandas DataFrame into `n_parallel` partitions, apply LFs/SFs in parallel across processes, and reassemble the results.
Dask support also requires `dill` for serialization of user-defined functions across process boundaries.
Usage
Use this environment when applying labeling or slicing functions to large datasets where single-process execution is too slow. The default parallelism is 2 processes. For single-process execution, use PandasLFApplier/PandasSFApplier instead.
Important: Dask is NOT guarded by try/except ImportError. Importing `DaskLFApplier` will fail immediately if Dask is not installed.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| Python | >= 3.11 | Inherited from core Snorkel requirement |
| CPU | Multi-core recommended | n_parallel should not exceed available cores |
Dependencies
Python Packages
- `dask[dataframe]` >= 2020.12.0
- `distributed` >= 2023.7.0
- `dill` >= 0.3.0
Credentials
No credentials required.
Quick Install
pip install "dask[dataframe]>=2020.12.0" "distributed>=2023.7.0" "dill>=0.3.0"
Code Evidence
Direct import without guard from `labeling/apply/dask.py:6-7`:
from dask import dataframe as dd
from dask.distributed import Client
Minimum parallelism enforcement from `labeling/apply/dask.py:76-82`:
if n_parallel < 2:
raise ValueError(
"n_parallel should be >= 2. "
"For single process Pandas, use PandasLFApplier."
)
df = dd.from_pandas(df, npartitions=n_parallel)
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `ModuleNotFoundError: No module named 'dask'` | Dask not installed | `pip install "dask[dataframe]>=2020.12.0"` |
| `ValueError: n_parallel should be >= 2` | n_parallel set to 1 or 0 | Use `PandasLFApplier` for single-process, or set `n_parallel >= 2` |
| `ModuleNotFoundError: No module named 'dill'` | Dill not installed for serialization | `pip install dill>=0.3.0` |
Compatibility Notes
- No ImportError guard: Importing `from snorkel.labeling.apply.dask import DaskLFApplier` will crash if Dask is not installed.
- Default scheduler is "processes": Uses multiprocessing by default. Can be changed to "threads" or "synchronous" for debugging.
- n_parallel recommendation: Should be no more than the number of cores on the running machine.