Workflow:ChenghaoMou Text dedup Bloom Filter Deduplication

Knowledge Sources	text-dedup text-dedup Docs
Domains	Data_Engineering, NLP, Deduplication
Last Updated	2026-02-14 21:00 GMT

Overview

End-to-end process for exact duplicate text detection using a memory-efficient Bloom filter for streaming deduplication.

Description

This workflow detects exact duplicate documents using a Bloom filter, a probabilistic data structure that can definitively say a document has not been seen before, or probabilistically say it has. Documents are processed sequentially: each document's text is checked against the Bloom filter, and if not present, it is added. This streaming approach is memory-efficient and handles arbitrarily large datasets without storing all document texts. The tradeoff is a configurable false positive rate (documents incorrectly flagged as duplicates) but zero false negatives (true duplicates are never missed).

Goal: A deduplicated dataset with exact-match duplicate documents removed using minimal memory.

Scope: From raw text data through Bloom filter indexing and filtered output.

Strategy: Uses a Bloom filter for O(1) membership testing with configurable error rate, processing documents in a single sequential pass.

Usage

Execute this workflow when you need to remove exact duplicate documents from a dataset and memory efficiency is a priority. The Bloom filter approach is ideal when near-duplicate detection is not required and the primary concern is removing verbatim copies. It is the simplest and fastest deduplication method in the library, suitable for initial cleaning passes before applying more sophisticated near-duplicate algorithms.

Execution Steps

Step 1: Configuration Loading

Parse the TOML configuration file into a typed Config object. The Bloom filter configuration specifies the expected number of elements and desired error rate, which together determine the filter's bit array size and number of hash functions. The text column name is also configured here.

Key considerations:

Expected elements and error rate control memory usage and false positive probability
Lower error rates require more memory (larger bit arrays)
The text column must contain the full document text for exact matching

Step 2: Data Loading

Load the dataset using the unified data I/O layer from local files or HuggingFace datasets. Each document receives an internal index column for tracking.

Key considerations:

Same data loading abstraction as all other algorithms
No preprocessing or filtering is applied before indexing

Step 3: Bloom Filter Indexing

Process documents sequentially through the Bloom filter. For each document, check if its text is already present in the filter. If present, mark the document as a duplicate. If not present, add the text to the filter and mark as non-duplicate. This step must run single-threaded because the Bloom filter is a stateful data structure that cannot be safely shared across processes.

Key considerations:

Single-threaded execution is mandatory (Bloom filter is not pickleable or thread-safe)
A warning is logged if num_proc > 1 is configured
Each document gets a boolean "duplicate" flag

Step 4: Duplicate Removal and Output

Filter the dataset to keep only non-duplicate documents. Save the deduplicated dataset to disk in HuggingFace Dataset format. Clean up cache files if configured. No cluster metadata is generated for Bloom filter deduplication since it operates on exact matches without clustering.

Key considerations:

No cluster assignments are produced (empty dict passed to save)
Output is a filtered HuggingFace Dataset
The skip_filtering option preserves duplicate flags without removing documents

Execution Diagram

GitHub URL

Workflow Repository