Environment:Snorkel team Snorkel PyTorch

Knowledge Sources	Snorkel PyTorch
Domains	Infrastructure, Deep_Learning
Last Updated	2026-02-14 21:00 GMT

Overview

Python 3.11+ environment with PyTorch >= 1.2.0, NumPy, SciPy, and core scientific computing stack required for all Snorkel label modeling and classification workflows.

Description

This environment provides the core runtime for Snorkel's PyTorch-dependent modules: the LabelModel (a generative model trained via gradient descent), the MultitaskClassifier (a multi-task neural network), and all classification/training infrastructure. PyTorch is used for tensor operations, autograd, nn.Module subclassing, DataParallel, and optimizer/scheduler registries.

The environment includes the full scientific Python stack: NumPy for matrix operations, SciPy for sparse matrix handling, pandas for data manipulation, scikit-learn for metrics, and TensorBoard for visualization.

Usage

Use this environment for any Snorkel workflow that involves the LabelModel (weak supervision pipeline) or MultitaskClassifier/SliceAwareClassifier (slice-aware training, multitask classification). The core labeling and augmentation modules (LabelingFunction, TransformationFunction, appliers) work without this environment if only using CPU-based operations, but the label model and classification pipelines require PyTorch.

System Requirements

Category	Requirement	Notes
OS	Linux, macOS, Windows	Linux recommended for CUDA support
Python	>= 3.11	Enforced in setup.py: `python_requires=">=3.11"`
Hardware	CPU (default) or NVIDIA GPU	LabelModel defaults to CPU; MultitaskClassifier defaults to GPU 0
Disk	2GB+	For package installation and model checkpoints

Dependencies

System Packages

`python` >= 3.11

Python Packages (Essential)

`torch` >= 1.2.0
`numpy` >= 1.24.0
`scipy` >= 1.2.0
`pandas` >= 1.0.0
`scikit-learn` >= 0.20.2
`tqdm` >= 4.33.0
`munkres` >= 1.0.6
`networkx` >= 2.2
`tensorboard` >= 2.13.0
`protobuf` >= 3.19.6

Credentials

No environment variables or credentials are required. All configuration is done through Python config objects (NamedTuples/Config classes). The codebase contains zero `os.environ` calls.

Quick Install

# Install Snorkel with all essential dependencies
pip install snorkel

# Or install dependencies manually
pip install torch>=1.2.0 numpy>=1.24.0 scipy>=1.2.0 pandas>=1.0.0 scikit-learn>=0.20.2 tqdm>=4.33.0 munkres>=1.0.6 networkx>=2.2 tensorboard>=2.13.0 protobuf>=3.19.6

Code Evidence

CUDA validation from `label_model.py:141-143`:

        # Confirm that cuda is available if config is using CUDA
        if self.config.device != "cpu" and not torch.cuda.is_available():
            raise ValueError("device=cuda but CUDA not available.")

MultitaskClassifier soft fallback from `multitask_classifier.py:481-489`:

    def _move_to_device(self) -> None:
        device = self.config.device
        if device >= 0:
            if torch.cuda.is_available():
                logging.info(f"Moving model to GPU (cuda:{device}).")
                self.to(torch.device(f"cuda:{device}"))
            else:
                logging.info("No cuda device available. Switch to cpu instead.")

Python version requirement from `setup.py:49`:

    python_requires=">=3.11",

Common Errors

Error Message	Cause	Solution
`ValueError: device=cuda but CUDA not available.`	LabelModel configured with `device="cuda"` but no GPU present	Use default `device="cpu"` or install CUDA toolkit
`No cuda device available. Switch to cpu instead.`	MultitaskClassifier configured with GPU but CUDA unavailable	This is a soft fallback (warning only); set `device=-1` to explicitly use CPU
`ModuleNotFoundError: No module named 'torch'`	PyTorch not installed	`pip install torch>=1.2.0`

Compatibility Notes

Device specification is inconsistent: LabelModel uses string-based device (`"cpu"`, `"cuda:0"`), while MultitaskClassifier uses integer-based (`-1`=CPU, `0`=GPU 0). LabelModel hard-fails on missing CUDA; MultitaskClassifier silently falls back to CPU.
DataParallel ON by default: The MultitaskClassifier wraps all module pool entries in `nn.DataParallel` by default (`dataparallel: bool = True`). Set `dataparallel=False` on single-GPU systems to avoid overhead.
No upper version bounds: All dependencies use `>=` constraints with no upper bounds, which could lead to breakage with future major versions of PyTorch or NumPy.
Build system: Requires `setuptools >= 40.6.2` and `wheel >= 0.30.0` (from pyproject.toml).

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment