Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Snorkel team Snorkel PySpark

From Leeroopedia
Knowledge Sources
Domains Infrastructure, Distributed_Computing
Last Updated 2026-02-14 21:00 GMT

Overview

Optional PySpark == 3.4.1 environment for distributed labeling function application on Spark RDDs and DataFrames.

Description

This environment enables distributed execution of Snorkel labeling functions on Apache Spark clusters. The SparkLFApplier works with PySpark RDD objects, and SparkMapper wraps preprocessing for Spark DataFrames. PySpark is intentionally excluded from Snorkel's main dependencies because installing a new version may overwrite an existing system Spark installation.

Usage

Use this environment when applying labeling functions to datasets that are too large to fit in memory on a single machine, or when your data is already stored in a Spark cluster. For smaller datasets, use PandasLFApplier or DaskLFApplier instead.

Important: PySpark is NOT guarded by try/except ImportError. Importing modules that use PySpark will fail immediately if it is not installed.

System Requirements

Category Requirement Notes
Python >= 3.11 Inherited from core Snorkel requirement
Java JDK 8 or 11 Required by Apache Spark
Hardware Spark cluster or local mode Tested with Spark standalone

Dependencies

Python Packages

  • `pyspark` == 3.4.1

Credentials

No Snorkel-specific credentials required. Spark cluster configuration is handled externally via Spark configuration files.

Quick Install

# WARNING: This may overwrite your existing system Spark installation
pip install pyspark==3.4.1

Code Evidence

Direct import without guard from `labeling/apply/spark.py:4`:

from pyspark import RDD

Direct import from `map/spark.py:1`:

from pyspark.sql import Row

Version pinning from `requirements-pyspark.txt:1-3`:

# Note: we don't include PySpark in the normal required installs.
# Installing a new version may overwrite your existing system install.
pyspark==3.4.1

Common Errors

Error Message Cause Solution
`ModuleNotFoundError: No module named 'pyspark'` PySpark not installed `pip install pyspark==3.4.1`
`Exception: Java gateway process exited before sending its port number` Java not installed or misconfigured Install JDK 8 or 11 and set JAVA_HOME

Compatibility Notes

  • Exact version pin: PySpark is pinned to exactly 3.4.1 (not `>=`). This is intentional to avoid overwriting system Spark installations.
  • Separate requirements file: PySpark has its own `requirements-pyspark.txt` file, separate from the main `requirements.txt`.
  • No ImportError guard: Importing `from snorkel.labeling.apply.spark import SparkLFApplier` will crash if PySpark is not installed.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment