Heuristic:Snorkel team Snorkel NLP Preprocessor Memoization
| Knowledge Sources | |
|---|---|
| Domains | NLP, Optimization |
| Last Updated | 2026-02-14 21:00 GMT |
Overview
NLPLabelingFunction enables memoization by default, caching spaCy Doc objects across all NLP labeling function instances that share the same preprocessor to avoid redundant NLP processing.
Description
The Mapper/Preprocessor system in Snorkel includes an optional memoization cache. When `memoize=True` (the default for NLP preprocessors), the result of applying a preprocessor to a data point is stored in a dictionary keyed by a hashable representation of the input. Subsequent calls with the same input return the cached result. This is particularly valuable for NLP labeling functions where spaCy parsing is expensive.
Additionally, the memoization system uses `pickle.loads(pickle.dumps(x))` instead of `copy.deepcopy()` for creating copies of data points, which is a deliberate workaround for known deepcopy issues with pandas Series and SimpleNamespace objects containing dictionary attributes.
Usage
The default behavior is typically correct and should not need adjustment. Consider disabling memoization (`memoize=False`) only when:
- Memory is severely constrained and the cache grows too large
- Data points are not hashable
- You need to mutate data points in place
The Insight (Rule of Thumb)
- Action: Keep `memoize=True` (default) for NLPLabelingFunction and SpacyPreprocessor when applying multiple NLP-based LFs to the same dataset.
- Value: Default is `True` for NLP preprocessors.
- Trade-off: Trades memory for speed. Each unique data point's spaCy Doc is cached, which can consume significant memory for large datasets but avoids re-parsing text for each NLP labeling function.
Reasoning
When applying N NLP labeling functions to a dataset, without memoization each LF would independently parse every text with spaCy. With memoization, each text is parsed once, and the resulting Doc object is shared across all N LFs. For a dataset of D documents and N NLP LFs, this reduces spaCy calls from D*N to D.
The pickle roundtrip for deep copying is explained in a code comment:
Code evidence from `map/core.py:157-160`:
# NB: using pickle roundtrip as a more robust deepcopy
# As an example, calling deepcopy on a pd.Series or SimpleNamespace
# with a dictionary attribute won't create a copy of the dictionary
x_mapped = pickle.loads(pickle.dumps(x))
Memoization cache from `map/core.py:152-166`:
if self.memoize:
x_hashable = self._memoize_key(x)
if x_hashable in self._cache:
return self._cache[x_hashable]
...
if self.memoize:
self._cache[x_hashable] = x_mapped