Heuristic:Snorkel team Snorkel NLP Preprocessor Memoization

Knowledge Sources	Snorkel
Domains	NLP, Optimization
Last Updated	2026-02-14 21:00 GMT

Overview

NLPLabelingFunction enables memoization by default, caching spaCy Doc objects across all NLP labeling function instances that share the same preprocessor to avoid redundant NLP processing.

Description

The Mapper/Preprocessor system in Snorkel includes an optional memoization cache. When `memoize=True` (the default for NLP preprocessors), the result of applying a preprocessor to a data point is stored in a dictionary keyed by a hashable representation of the input. Subsequent calls with the same input return the cached result. This is particularly valuable for NLP labeling functions where spaCy parsing is expensive.

Additionally, the memoization system uses `pickle.loads(pickle.dumps(x))` instead of `copy.deepcopy()` for creating copies of data points, which is a deliberate workaround for known deepcopy issues with pandas Series and SimpleNamespace objects containing dictionary attributes.

Usage

The default behavior is typically correct and should not need adjustment. Consider disabling memoization (`memoize=False`) only when:

Memory is severely constrained and the cache grows too large
Data points are not hashable
You need to mutate data points in place

The Insight (Rule of Thumb)

Action: Keep `memoize=True` (default) for NLPLabelingFunction and SpacyPreprocessor when applying multiple NLP-based LFs to the same dataset.
Value: Default is `True` for NLP preprocessors.
Trade-off: Trades memory for speed. Each unique data point's spaCy Doc is cached, which can consume significant memory for large datasets but avoids re-parsing text for each NLP labeling function.

Reasoning

When applying N NLP labeling functions to a dataset, without memoization each LF would independently parse every text with spaCy. With memoization, each text is parsed once, and the resulting Doc object is shared across all N LFs. For a dataset of D documents and N NLP LFs, this reduces spaCy calls from D*N to D.

The pickle roundtrip for deep copying is explained in a code comment:

Code evidence from `map/core.py:157-160`:

        # NB: using pickle roundtrip as a more robust deepcopy
        # As an example, calling deepcopy on a pd.Series or SimpleNamespace
        # with a dictionary attribute won't create a copy of the dictionary
        x_mapped = pickle.loads(pickle.dumps(x))

Memoization cache from `map/core.py:152-166`:

        if self.memoize:
            x_hashable = self._memoize_key(x)
            if x_hashable in self._cache:
                return self._cache[x_hashable]
        ...
        if self.memoize:
            self._cache[x_hashable] = x_mapped

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment