Heuristic:ChenghaoMou Text dedup Bloom Filter Single Process

Knowledge Sources	text-dedup Internal
Domains	Optimization, Text_Deduplication, Bloom_Filters
Last Updated	2026-02-14 21:00 GMT

Overview

Bloom filter deduplication must run with `num_proc=1` because the Bloom filter object is stateful and not pickleable across processes.

Description

The Bloom filter algorithm maintains a single in-memory Bloom filter object (from the `rbloom` library) that tracks which documents have been seen. Each document is checked against the filter and then added to it. This sequential, stateful nature means the filter cannot be distributed across multiple processes. The `rbloom.Bloom` object is not pickleable, which prevents Python's multiprocessing from serializing it for worker processes.

Usage

Apply this heuristic whenever using the Bloom filter deduplication algorithm. If `num_proc > 1` is set in the configuration, the library will emit a warning and silently override it to `num_proc=1`. Be aware that Bloom filter deduplication is inherently single-threaded for the indexing phase, making it slower on large datasets compared to MinHash or SimHash which support parallel fingerprinting.

The Insight (Rule of Thumb)

Action: Accept that Bloom filter indexing runs with `num_proc=1` regardless of configuration.
Value: `num_proc=1` (hardcoded for the indexing map operation).
Trade-off: Single-process indexing is slower than parallel alternatives but ensures correctness. The filtering phase (removing duplicates) can still use multiprocessing.
Alternative: If parallelism is critical, consider MinHash LSH or SimHash algorithms instead.

Reasoning

The Bloom filter is a probabilistic set membership data structure. Each `add()` operation mutates the internal bit array, and each `__contains__()` check reads from it. These operations must happen sequentially on the same object instance to maintain consistency. Python's `multiprocessing` module would need to pickle the Bloom object to send it to worker processes, but the `rbloom.Bloom` C extension object is not pickleable. Even if it were, each worker would get an independent copy, causing missed duplicates.

The code explicitly uses `new_fingerprint=str(uuid.uuid4())` to bypass HuggingFace Datasets' fingerprinting cache because the Bloom filter's internal state cannot be captured by the cache key.

Code Evidence

Warning and override from `src/text_dedup/bloom_filter.py:28-31`:

if config.algorithm.num_proc > 1:
    log.warning(
        "Bloom filter does not support multi-processing due to state requirements. Using num_proc=1 instead."
    )

Hardcoded `num_proc=1` in the map call from `src/text_dedup/bloom_filter.py:39-46`:

ds = ds.map(
    f,
    num_proc=1,
    desc="Indexing...",
    input_columns=[algo.text_column],
    # * Bloom object is not pickleable
    new_fingerprint=str(uuid.uuid4()),
)

Note the comment `# * Bloom object is not pickleable` confirming the constraint.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment