Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Snorkel team Snorkel PyTorch

From Leeroopedia
Knowledge Sources
Domains Infrastructure, Deep_Learning
Last Updated 2026-02-14 21:00 GMT

Overview

Python 3.11+ environment with PyTorch >= 1.2.0, NumPy, SciPy, and core scientific computing stack required for all Snorkel label modeling and classification workflows.

Description

This environment provides the core runtime for Snorkel's PyTorch-dependent modules: the LabelModel (a generative model trained via gradient descent), the MultitaskClassifier (a multi-task neural network), and all classification/training infrastructure. PyTorch is used for tensor operations, autograd, nn.Module subclassing, DataParallel, and optimizer/scheduler registries.

The environment includes the full scientific Python stack: NumPy for matrix operations, SciPy for sparse matrix handling, pandas for data manipulation, scikit-learn for metrics, and TensorBoard for visualization.

Usage

Use this environment for any Snorkel workflow that involves the LabelModel (weak supervision pipeline) or MultitaskClassifier/SliceAwareClassifier (slice-aware training, multitask classification). The core labeling and augmentation modules (LabelingFunction, TransformationFunction, appliers) work without this environment if only using CPU-based operations, but the label model and classification pipelines require PyTorch.

System Requirements

Category Requirement Notes
OS Linux, macOS, Windows Linux recommended for CUDA support
Python >= 3.11 Enforced in setup.py: `python_requires=">=3.11"`
Hardware CPU (default) or NVIDIA GPU LabelModel defaults to CPU; MultitaskClassifier defaults to GPU 0
Disk 2GB+ For package installation and model checkpoints

Dependencies

System Packages

  • `python` >= 3.11

Python Packages (Essential)

  • `torch` >= 1.2.0
  • `numpy` >= 1.24.0
  • `scipy` >= 1.2.0
  • `pandas` >= 1.0.0
  • `scikit-learn` >= 0.20.2
  • `tqdm` >= 4.33.0
  • `munkres` >= 1.0.6
  • `networkx` >= 2.2
  • `tensorboard` >= 2.13.0
  • `protobuf` >= 3.19.6

Credentials

No environment variables or credentials are required. All configuration is done through Python config objects (NamedTuples/Config classes). The codebase contains zero `os.environ` calls.

Quick Install

# Install Snorkel with all essential dependencies
pip install snorkel

# Or install dependencies manually
pip install torch>=1.2.0 numpy>=1.24.0 scipy>=1.2.0 pandas>=1.0.0 scikit-learn>=0.20.2 tqdm>=4.33.0 munkres>=1.0.6 networkx>=2.2 tensorboard>=2.13.0 protobuf>=3.19.6

Code Evidence

CUDA validation from `label_model.py:141-143`:

        # Confirm that cuda is available if config is using CUDA
        if self.config.device != "cpu" and not torch.cuda.is_available():
            raise ValueError("device=cuda but CUDA not available.")

MultitaskClassifier soft fallback from `multitask_classifier.py:481-489`:

    def _move_to_device(self) -> None:
        device = self.config.device
        if device >= 0:
            if torch.cuda.is_available():
                logging.info(f"Moving model to GPU (cuda:{device}).")
                self.to(torch.device(f"cuda:{device}"))
            else:
                logging.info("No cuda device available. Switch to cpu instead.")

Python version requirement from `setup.py:49`:

    python_requires=">=3.11",

Common Errors

Error Message Cause Solution
`ValueError: device=cuda but CUDA not available.` LabelModel configured with `device="cuda"` but no GPU present Use default `device="cpu"` or install CUDA toolkit
`No cuda device available. Switch to cpu instead.` MultitaskClassifier configured with GPU but CUDA unavailable This is a soft fallback (warning only); set `device=-1` to explicitly use CPU
`ModuleNotFoundError: No module named 'torch'` PyTorch not installed `pip install torch>=1.2.0`

Compatibility Notes

  • Device specification is inconsistent: LabelModel uses string-based device (`"cpu"`, `"cuda:0"`), while MultitaskClassifier uses integer-based (`-1`=CPU, `0`=GPU 0). LabelModel hard-fails on missing CUDA; MultitaskClassifier silently falls back to CPU.
  • DataParallel ON by default: The MultitaskClassifier wraps all module pool entries in `nn.DataParallel` by default (`dataparallel: bool = True`). Set `dataparallel=False` on single-GPU systems to avoid overhead.
  • No upper version bounds: All dependencies use `>=` constraints with no upper bounds, which could lead to breakage with future major versions of PyTorch or NumPy.
  • Build system: Requires `setuptools >= 40.6.2` and `wheel >= 0.30.0` (from pyproject.toml).

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment