Principle:ChenghaoMou Text dedup Bloom Filter Deduplication

Knowledge Sources	Bloom Filter text-dedup
Domains	Data_Structures, Deduplication
Last Updated	2026-02-14 21:00 GMT

Overview

A space-efficient probabilistic data structure that performs streaming exact-match deduplication by testing document membership in a set with a controlled false positive rate.

Description

Bloom Filter Deduplication uses a probabilistic set data structure to detect exact duplicate documents in a single pass. For each document, the filter checks if the full text has been seen before (membership test). If yes, the document is marked as a duplicate. If no, the text is added to the filter. The Bloom filter provides: (1) constant-time O(1) membership testing and insertion, (2) guaranteed zero false negatives (a seen document will always be detected), (3) controlled false positive rate determined by filter capacity and error rate parameters.

The key constraint is that Bloom filters are stateful and not parallelizable — documents must be processed sequentially (num_proc=1) to maintain correctness.

Usage

Use this principle when exact-match deduplication is needed and memory efficiency is more important than supporting near-duplicate detection. Ideal for large corpora where exact copies need to be removed in a single streaming pass.

Theoretical Basis

A Bloom filter uses k hash functions and an m-bit array:

# Abstract Bloom filter operation (NOT real implementation)
for each document:
    if text in bloom_filter:  # All k hash positions are set
        mark_duplicate(document)
    else:
        bloom_filter.add(text)  # Set k hash positions

The false positive rate for n inserted elements: $p \approx (1 - e^{- k n / m})^{k}$

Related Pages

Implemented By

Implementation:ChenghaoMou_Text_dedup_Bloom_Filter_Func

Uses Heuristic

Heuristic:ChenghaoMou_Text_dedup_Bloom_Filter_Single_Process

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment