Principle:ChenghaoMou Text dedup Bloom Filter Deduplication
| Knowledge Sources | |
|---|---|
| Domains | Data_Structures, Deduplication |
| Last Updated | 2026-02-14 21:00 GMT |
Overview
A space-efficient probabilistic data structure that performs streaming exact-match deduplication by testing document membership in a set with a controlled false positive rate.
Description
Bloom Filter Deduplication uses a probabilistic set data structure to detect exact duplicate documents in a single pass. For each document, the filter checks if the full text has been seen before (membership test). If yes, the document is marked as a duplicate. If no, the text is added to the filter. The Bloom filter provides: (1) constant-time O(1) membership testing and insertion, (2) guaranteed zero false negatives (a seen document will always be detected), (3) controlled false positive rate determined by filter capacity and error rate parameters.
The key constraint is that Bloom filters are stateful and not parallelizable — documents must be processed sequentially (num_proc=1) to maintain correctness.
Usage
Use this principle when exact-match deduplication is needed and memory efficiency is more important than supporting near-duplicate detection. Ideal for large corpora where exact copies need to be removed in a single streaming pass.
Theoretical Basis
A Bloom filter uses k hash functions and an m-bit array:
# Abstract Bloom filter operation (NOT real implementation)
for each document:
if text in bloom_filter: # All k hash positions are set
mark_duplicate(document)
else:
bloom_filter.add(text) # Set k hash positions
The false positive rate for n inserted elements: