Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Snorkel team Snorkel SpaCy NLP

From Leeroopedia
Knowledge Sources
Domains Infrastructure, NLP
Last Updated 2026-02-14 21:00 GMT

Overview

Optional spaCy >= 2.1.0 environment with English language model required for NLP labeling functions, slicing functions, and text preprocessing.

Description

This environment provides NLP preprocessing capabilities via spaCy. It is required when using NLPLabelingFunction, NLPSlicingFunction, or SpacyPreprocessor. These components use spaCy to tokenize text, extract named entities, POS tags, and dependency parses, making them available to user-defined labeling and slicing functions.

The default language model is `en_core_web_sm` (small English model). Optional GPU acceleration is available via `spacy.prefer_gpu()`.

Usage

Use this environment when writing labeling or slicing functions that need NLP features (tokens, entities, POS tags). If your labeling functions only use string operations or regex, this environment is not required.

Important: spaCy is NOT guarded by try/except ImportError. Importing any module that depends on spaCy (e.g., `from snorkel.labeling import NLPLabelingFunction`) will fail immediately if spaCy is not installed.

System Requirements

Category Requirement Notes
Python >= 3.11 Inherited from core Snorkel requirement
Hardware CPU (default) GPU optional via `gpu=True` parameter

Dependencies

Python Packages

  • `spacy` >= 2.1.0
  • `blis` >= 0.3.0

Language Models

  • `en_core_web_sm` (default, must be downloaded separately)

Credentials

No credentials required.

Quick Install

# Install spaCy
pip install spacy>=2.1.0 blis>=0.3.0

# Download the default English model
python -m spacy download en_core_web_sm

Code Evidence

Direct import without guard from `preprocess/nlp.py:3`:

import spacy

Default language model from `preprocess/nlp.py:9`:

EN_CORE_WEB_SM = "en_core_web_sm"

GPU preference from `preprocess/nlp.py:69-72`:

        self.gpu = gpu
        if self.gpu:
            spacy.prefer_gpu()
        self._nlp = spacy.load(language, disable=disable or [])

Common Errors

Error Message Cause Solution
`ModuleNotFoundError: No module named 'spacy'` spaCy not installed `pip install spacy>=2.1.0`
`OSError: Can't find model 'en_core_web_sm'` Language model not downloaded `python -m spacy download en_core_web_sm`
`ModuleNotFoundError: No module named 'blis'` blis not installed `pip install blis>=0.3.0`

Compatibility Notes

  • No ImportError guard: Unlike many optional dependencies in Python libraries, spaCy is imported directly at the top of the module. This means `from snorkel.labeling import NLPLabelingFunction` will crash if spaCy is not installed, even if you never call the class.
  • GPU is opt-in: Unlike the MultitaskClassifier which defaults to GPU, spaCy GPU usage requires explicitly passing `gpu=True` to the NLPLabelingFunction or SpacyPreprocessor constructor.
  • Memoization enabled by default: NLPLabelingFunction instances share a cached spaCy Doc object across all instances using the same preprocessor, reducing repeated NLP processing.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment