Environment:Snorkel team Snorkel SpaCy NLP
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, NLP |
| Last Updated | 2026-02-14 21:00 GMT |
Overview
Optional spaCy >= 2.1.0 environment with English language model required for NLP labeling functions, slicing functions, and text preprocessing.
Description
This environment provides NLP preprocessing capabilities via spaCy. It is required when using NLPLabelingFunction, NLPSlicingFunction, or SpacyPreprocessor. These components use spaCy to tokenize text, extract named entities, POS tags, and dependency parses, making them available to user-defined labeling and slicing functions.
The default language model is `en_core_web_sm` (small English model). Optional GPU acceleration is available via `spacy.prefer_gpu()`.
Usage
Use this environment when writing labeling or slicing functions that need NLP features (tokens, entities, POS tags). If your labeling functions only use string operations or regex, this environment is not required.
Important: spaCy is NOT guarded by try/except ImportError. Importing any module that depends on spaCy (e.g., `from snorkel.labeling import NLPLabelingFunction`) will fail immediately if spaCy is not installed.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| Python | >= 3.11 | Inherited from core Snorkel requirement |
| Hardware | CPU (default) | GPU optional via `gpu=True` parameter |
Dependencies
Python Packages
- `spacy` >= 2.1.0
- `blis` >= 0.3.0
Language Models
- `en_core_web_sm` (default, must be downloaded separately)
Credentials
No credentials required.
Quick Install
# Install spaCy
pip install spacy>=2.1.0 blis>=0.3.0
# Download the default English model
python -m spacy download en_core_web_sm
Code Evidence
Direct import without guard from `preprocess/nlp.py:3`:
import spacy
Default language model from `preprocess/nlp.py:9`:
EN_CORE_WEB_SM = "en_core_web_sm"
GPU preference from `preprocess/nlp.py:69-72`:
self.gpu = gpu
if self.gpu:
spacy.prefer_gpu()
self._nlp = spacy.load(language, disable=disable or [])
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `ModuleNotFoundError: No module named 'spacy'` | spaCy not installed | `pip install spacy>=2.1.0` |
| `OSError: Can't find model 'en_core_web_sm'` | Language model not downloaded | `python -m spacy download en_core_web_sm` |
| `ModuleNotFoundError: No module named 'blis'` | blis not installed | `pip install blis>=0.3.0` |
Compatibility Notes
- No ImportError guard: Unlike many optional dependencies in Python libraries, spaCy is imported directly at the top of the module. This means `from snorkel.labeling import NLPLabelingFunction` will crash if spaCy is not installed, even if you never call the class.
- GPU is opt-in: Unlike the MultitaskClassifier which defaults to GPU, spaCy GPU usage requires explicitly passing `gpu=True` to the NLPLabelingFunction or SpacyPreprocessor constructor.
- Memoization enabled by default: NLPLabelingFunction instances share a cached spaCy Doc object across all instances using the same preprocessor, reducing repeated NLP processing.