Environment:Snorkel team Snorkel PySpark
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Distributed_Computing |
| Last Updated | 2026-02-14 21:00 GMT |
Overview
Optional PySpark == 3.4.1 environment for distributed labeling function application on Spark RDDs and DataFrames.
Description
This environment enables distributed execution of Snorkel labeling functions on Apache Spark clusters. The SparkLFApplier works with PySpark RDD objects, and SparkMapper wraps preprocessing for Spark DataFrames. PySpark is intentionally excluded from Snorkel's main dependencies because installing a new version may overwrite an existing system Spark installation.
Usage
Use this environment when applying labeling functions to datasets that are too large to fit in memory on a single machine, or when your data is already stored in a Spark cluster. For smaller datasets, use PandasLFApplier or DaskLFApplier instead.
Important: PySpark is NOT guarded by try/except ImportError. Importing modules that use PySpark will fail immediately if it is not installed.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| Python | >= 3.11 | Inherited from core Snorkel requirement |
| Java | JDK 8 or 11 | Required by Apache Spark |
| Hardware | Spark cluster or local mode | Tested with Spark standalone |
Dependencies
Python Packages
- `pyspark` == 3.4.1
Credentials
No Snorkel-specific credentials required. Spark cluster configuration is handled externally via Spark configuration files.
Quick Install
# WARNING: This may overwrite your existing system Spark installation
pip install pyspark==3.4.1
Code Evidence
Direct import without guard from `labeling/apply/spark.py:4`:
from pyspark import RDD
Direct import from `map/spark.py:1`:
from pyspark.sql import Row
Version pinning from `requirements-pyspark.txt:1-3`:
# Note: we don't include PySpark in the normal required installs.
# Installing a new version may overwrite your existing system install.
pyspark==3.4.1
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `ModuleNotFoundError: No module named 'pyspark'` | PySpark not installed | `pip install pyspark==3.4.1` |
| `Exception: Java gateway process exited before sending its port number` | Java not installed or misconfigured | Install JDK 8 or 11 and set JAVA_HOME |
Compatibility Notes
- Exact version pin: PySpark is pinned to exactly 3.4.1 (not `>=`). This is intentional to avoid overwriting system Spark installations.
- Separate requirements file: PySpark has its own `requirements-pyspark.txt` file, separate from the main `requirements.txt`.
- No ImportError guard: Importing `from snorkel.labeling.apply.spark import SparkLFApplier` will crash if PySpark is not installed.